1 Introduction

The Split-\(\widehat{R}\) statistic and the effective sample size (abbreviated as ESS or \(S_{\rm eff}\); previously called \(N_{\rm eff}\) or \(n_{\rm eff}\)) are routinely used to monitor the convergence of iterative simulations, which are omnipresent in Bayesian statistics in the form of Markov-Chain Monte-Carlo samples. The original \(\widehat{R}\) statistic (Gelman and Rubin, 1992; Brooks and Gelman, 1998) and split-\(\widehat{R}\) (Gelman et al., 2013) are both based on the ratio of between and within-chain marginal variances of the simulations, while the latter is computed from split chains (hence the name).

\(\widehat{R}\), split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal distributions have finite mean and variance. Even if that’s the case, their estimates are less stable for distributions with long tails. To alleviate these problems, we define split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals.

The code for the new split-\(\widehat{R}\) and \(S_{\rm eff}\) versions and for the corresponding diagnostic plots can be found in monitornew.R and monitorplot.R, respectively.

2 Review of split-\(\widehat{R}\) and effective sample size

In this section, we will review the split-\(\widehat{R}\) and effective sample size estimates as implemented in Stan 2.18 (Stan Development Team, 2018e). These implementations represent the current de facto standard of convergence diagnostics for iterative simulations.

2.1 Split-\(\widehat{R}\)

Below, we present the computation of Split-\(\widehat{R}\) following Gelman et al. (2013), but using the notation style of Stan Development Team (2018a). Our general recommendation is to always run several chains. \(N\) is the number of draws per chain, \(M\) is the number of chains, and \(S=MN\) is the total number of draws from all chains. For each scalar summary of interest \(\theta\), we compute \(B\) and \(W\), the between- and within-chain variances:

\[ B = \frac{N}{M-1}\sum_{m=1}^{M}(\overline{\theta}^{(.m)}-\overline{\theta}^{(..)})^2, \;\mbox{ where }\;\;\overline{\theta}^{(.m)}=\frac{1}{N}\sum_{n=1}^N \theta^{(nm)},\;\; \;\;\overline{\theta}^{(..)} = \frac{1}{M}\sum_{m=1}^M\overline{\theta}^{(.m)} \\ W = \frac{1}{M}\sum_{m=1}^{M}s_j^2, \;\mbox{ where }\;\; s_m^2=\frac{1}{N-1} \sum_{n=1}^N (\theta^{(nm)}-\overline{\theta}^{(.m)})^2. \]

The between-chain variance, \(B\), also contains the factor \(N\) because it is based on the variance of the within-chain means, \(\overline{\theta}^{(.m)}\), each of which is an average of \(N\) values \(\theta^{(nm)}\).

We can estimate \(\mbox{var}(\theta \mid y)\), the marginal posterior variance of the estimand, by a weighted average of \(W\) and \(B\), namely \[ \widehat{\mbox{var}}^+(\theta \mid y)=\frac{N-1}{N}W + \frac{1}{N}B. \] This quantity overestimates the marginal posterior variance assuming the starting distribution of the simulations is appropriately overdispersed compared to the target distribution, but is unbiased under stationarity (that is, if the starting distribution equals the target distribution), or in the limit \(N\rightarrow\infty\). To have an overdispersed starting distribution, independent Markov chains should be initialized with diffuse starting values for the parameters.

Meanwhile, for any finite \(N\), the within-chain variance \(W\) should underestimate \(\mbox{var}(\theta \mid y)\) because the individual chains haven’t had the time to explore all of the target distribution and, as a result, will have less variability. In the limit as \(N\rightarrow\infty\), the expectation of \(W\) also approaches \(\mbox{var}(\theta \mid y)\).

We monitor convergence of the iterative simulations to the target distribution by estimating the factor by which the scale of the current distribution for \(\theta\) might be reduced if the simulations were continued in the limit \(N\rightarrow\infty\). This potential scale reduction is estimated as \[ \widehat{R}= \sqrt{\frac{\widehat{\mbox{var}}^+(\theta \mid y)}{W}}, \] which declines to 1 as \(N\rightarrow\infty\). We call this split-\(\widehat{R}\) because we are applying it to chains that have been split in half so that \(M\) is twice the number of actual chains. Without splitting, \(\widehat{R}\) would get fooled by non-stationary chains (see Appendix D).

We note that split-\(\widehat{R}\) is also well defined for sequences that are not Markov-chains. However, for simplicity, we always refer to ‘chains’ instead of more generally to ‘sequences’ as the former is our primary use case for \(\widehat{R}\)-like measures.

2.2 Effective sample size

If the \(N\) simulation draws within each chain were truly independent, the between-chain variance \(B\) would be an unbiased estimate of the posterior variance, \(\mbox{var}(\theta \mid y)\), and we would have a total of \(S = MN\) independent simulations from the \(M\) chains. In general, however, the simulations of \(\theta\) within each chain will be autocorrelated, and thus \(B\) will be larger than \(\mbox{var}(\theta \mid y)\), in expectation.

A nice introductory reference for analyzing MCMC results in general and effective sample size in particular is provided by Geyer (2011, see also 1992). The particular calculations used by Stan (Stan Development Team, 2018e) follow those for split-\(\widehat{R}\), which involve both between-chain (mean) and within-chain calculations (autocorrelation). They were introduced in the Stan manual (Stan Development Team, 2018d) and explained in more detail in Gelman et al. (2013).

One way to define effective sample size for correlated simulation draws is to consider the statistical efficiency of the average of the simulations \(\bar{\theta}^{(..)}\) as an estimate of the posterior mean \(\mbox{E}(\theta \mid y)\). This generalizes also to posterior expectations of functionals of parameters \(\mbox{E}(g(\theta) \mid y)\) and we return later to how to estimate the effective sample size of quantiles which cannot be presented as expectations. For simplification, in this section we consider the effective sample size for the posterior mean.

The effective sample size of a chain is defined in terms of the autocorrelations within the chain at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \] This is just the correlation between the two chains offset by \(t\) positions. Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta. \]

The effective sample size of one chain generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\rm eff} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]

Effective sample size \(N_{\rm eff}\) can be larger than \(N\) in case of antithetic Markov chains, which have negative autocorrelations on odd lags. The Dynamic Hamiltonian Monte-Carlo algorithms used in Stan (Hoffman and Gelman, 2014; Betancourt, 2017) can produce \(N_{\rm eff}>N\) for parameters with a close to Gaussian posterior (in the unconstrained space) and low dependency on other parameters.

2.2.1 Estimation of the Effective Sample Size

In practice, the probability function in question cannot be tractably integrated and thus neither autocorrelation nor the effective sample size can be calculated. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes an autocorrelation and split-\(\widehat{R}\) based effective sample size estimator, based on multiple split chains. For simplicity, each chain will be assumed to be of the same length \(N\).

Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Geyer (2011) for more details on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with the within-chain variance estimate \(W\) and the multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as \[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M \hat{\rho}_{t,j}}{\widehat{\mbox{var}}^{+}}. \label{rhohat} \] If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate the true marginal variance which leads to an overestimation of the autocorrelation and an underestimation of the effective sample size.

Because of noise in the correlation estimates \(\hat{\rho}_t\) increases as \(t\) increases, typically the truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations can happen only on odd lags and by summing over pairs starting from lag \(t=0\), the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise (Geyer, 1992, 2011). Stan 2.18 uses Geyer’s initial monotone sequence criterion. The effective sample size of combined chains is defined as \[ S_{\rm eff} = \frac{N \, M}{\hat{\tau}}, \] where \[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2k+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{k} \hat{P}_{t'}, \] and \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). The initial positive sequence estimator is obtained by choosing the largest \(k\) such that \(\hat{P}_{t'}>0\) for all \(t' = 1,\ldots,k\). The initial monotone sequence estimator is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding values so that the estimated sequence becomes monotone.

The effective sample size \(S_{\rm eff}\) described here is different from similar formulas in the literature in that we use multiple chains and between-chain variance in the computation, which typically gives us more conservative claims (lower values of \(S_{\rm eff}\)) compared to single chain estimates, especially when mixing of the chains is poor. If the chains are not mixing at all (e.g., the posterior is multimodal and the chains are stuck in different modes), then our \(S_{\rm eff}\) is close to the number of chains.

Before version 2.18, Stan used a slightly incorrect initial sequence which implied that \(S_{\rm eff}\) was capped at \(S\) and thus the effective sample size was underestimated for some models. As antithetic behavior (i.e., \(S_{\rm eff} > S\)) is not that common or the effect is small, and capping at \(S\) can be considered to be pessimistic, this had negligible effect on any inference. However, it may have led to an underestimation of Stan’s efficiency when compared to other packages performing MCMC sampling.

3 Rank normalized split-\(\widehat{R}\) and effective sample size

As split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal posteriors have finite mean and variance, we next introduce split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals which are well defined for all distributions and more robust for long tailed distributions.

3.1 Rank normalized split-\(\widehat{R}\)

Rank normalized split-\(\widehat{R}\) is computed using the equations in Section Split-\(\widehat{R}\) by replacing the original parameter values \(\theta^{(nm)}\) with their corresponding rank normalized values \(z^{(nm)}\).

Rank normalization:

  1. Rank: Replace each value \(\theta^{(nm)}\) by its rank \(r^{(nm)}\). Average rank for ties are used to conserve the number of unique values of discrete quantities. Ranks are computed jointly for all draws from all chains.
  2. Normalize: Normalize ranks by inverse normal transformation \(z^{(nm)} = \phi^{-1}((r^{(nm)}-1/2)/S)\).

Appendix B illustrates the rank normalization of multiple chains.

For continuous variables and \(S \rightarrow \infty\), the rank normalized values are normally distributed and rank normalization performs non-parametric transformation to normal distribution. Using normalized ranks instead of ranks directly, has the benefit that the behavior of \(\widehat{R}\) and \(S_{\rm eff}\) do not change from the ones presented in Section Split-\(\widehat{R}\) for normally distributed \(\theta\).

3.2 Rank normalized folded-split-\(\widehat{R}\)

Both original and rank-normalized split-\(\widehat{R}\) can be fooled if chains have different scales but the same location (see Appendix D). To alleviate this problem, we propose to compute rank normalized folded-split-\(\widehat{R}\) using folded split chains by rank normalizing absolute deviations from median \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \]

To obtain a single conservative \(\widehat{R}\) estimate, we propose to report the maximum of rank normalized split-\(\widehat{R}\) and rank normalized folded-split-\(\widehat{R}\) for each parameter.

3.3 Effective sample size using rank normalized values

In addition to using rank-normalized values for convergence diagnostics via \(\widehat{R}\), we can also compute the corresponding effective sample size \(S_{\rm eff}\), using equations in Section Effective sample size by replacing parameter values \(\theta^{(nm)}\) with rank normalized values \(z^{(nm)}\). This estimate will be well defined even if the original distribution does not have finite mean and variance. It is not directly applicable to estimate the Monte Carlo error of the mean of the original values, but it will provide a bijective transformation-invariant estimate of the mixing efficiency of chains. For parameters with a close to normal distribution, the difference to using the original values is small. However, for parameters with a distribution far from normal, rank normalization can be seen as a near optimal non-parametric transformation.

The effective sample size using rank normalized values is a useful measure for efficiency of estimating the bulk (mean and quantiles near the median) of the distribution, and as shorthand term we use term bulk effective sample size (bulk-ESS). Bulk-ESS is also useful for diagnosing problems due to trends or different means of the chains (see Appendix D).

3.4 Efficiency estimates for the cumulative distribution function

The bulk- and tail-ESS introduced above are useful as overall efficiency measures. Next, we introduce effective sample size estimates for the cumulative distribution function (CDF), and later we use that to introduce efficiency diagnostics for quantiles and local small probability intervals.

Quantiles and posterior intervals derived on their basis are often reported quantities which are easy to estimate from posterior draws. Estimating the efficiency of such quantile estimates thus has a high practical relevance in particular as we observe the efficiency for tail quantiles to often be lower than for the mean or median. The \(\alpha\)-quantile is defined as the parameter value \(\theta_\alpha\) for which \(p(\theta \leq \theta_\alpha) = \alpha\). An estimate \(\hat{\theta}_\alpha\) of \(\theta_\alpha\) can thus be obtained by finding the \(\alpha\)-quantile of the empirical CDF (ECDF) of the posterior draws \(\theta^{(s)}\). However, quantiles cannot be written as an expectation, and thus the above equations for \(\widehat{R}\) and \(S_{\rm eff}\) are not directly applicable. Thus, we first focus on the efficiency estimate for the cumulative probability \(p(\theta \leq \theta_\alpha)\) for different values of \(\theta_\alpha\).

For any \(\theta_\alpha\), the ECDF gives an estimate of the cumulative probability \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \] where \(I()\) is the indicator function. The indicator function transforms simulation draws to 0’s and 1’s, and thus the subsequent computations are bijectively invariant. Efficiency estimates of the ECDF at any \(\theta_\alpha\) can now be obtained by applying rank-normalizing and subsequent computations directly on the indicator function’s results. See Appendix C for an illustration of variance of ECDF.

3.5 Efficiency estimates for quantiles and tail-ESS

Assuming that we know the CDF to be a certain continuous function \(F\) which is smooth near an \(\alpha\)-quantile of interest, we could use the delta method to compute a variance estimate for \(F^{-1}(\bar{I}_\alpha)\). Although we don’t usually know \(F\), the delta method approach reveals that the variance of \(\bar{I}_\alpha\) for some \(\theta_\alpha\) is scaled by the (usually unknown) density \(f(\theta_\alpha)\), but the efficiency depends only on the efficiency of \(\bar{I}_\alpha\). Thus, we can use the effective sample size for the ECDF (we computed using the indicator function \(I(\theta^{(s)} \leq \theta_\alpha)\)) also for the corresponding quantile estimates.

In order to summarize the efficiency in the distributions’ tails, we propose to compute the minimum of the effectve sample sizes of the 5% and 95% quantiles. As a shorthand, we use the term tail effective sample size (tail-ESS). Tail-ESS is also useful for diagnosing problems due to different scales of the chains (see Appendix D).

3.6 Efficiency estimates for median and MAD

Since the marginal posterior distributions might not have finite mean and variance, by default RStan (Stan Development Team, 2018c) and RStanARM (Stan Development Team, 2018b) report median and median absolute deviation (MAD) instead of mean and standard error (SE). Median and MAD are well defined even when the marginal distribution does not have finite mean and variance. Since the median is just 50%-quantile, we can get efficiency estimate for it as for any other quantile.

We can also compute the efficiency estimate for the median absolute deviation (MAD), by computing the efficiency estimate for the median of absolute deviations from the median of all draws. The absolute deviations from the median of all draws are same as previously defined for folded samples \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \] We see that the efficiency estimate for MAD is obtained by using the same approach as for the median (and other quantiles) but with the folded values also used in rank-normalized-folded-split-\(S_{\rm eff}\).

3.7 Monte Carlo error estimates for quantiles

Previously, Stan has reported Monte Carlo standard error estimates for the mean of a quantity. This is valid only if the corresponding marginal distribution has finite mean and variance; and even if valid, it may be easier and more robust to focus on the median and other quantiles, instead.

Median, MAD and quantiles are well defined even when the distribution does not have finite mean and variance, and they are asymptotically normal for continuous distributions which are non-zero in the relevant neighborhood. As the delta method for computing the variance would require explicit knowledge of the normalized posterior density, we propose the following alternative approach:

  1. Compute quantiles of the \({\rm Beta}(S_{\rm eff}/S \bar{I}_\alpha+1, S_{\rm eff}/S(1-\bar{I}_\alpha)+1)\) distribution. Including \(S_{\rm eff}/S\) takes into account the efficiency estimate for the posterior draws.
  2. Find indices in \(\{1,\ldots,S\}\) closest to the ranks of these quantiles. For example, for quantile \(Q\), find \(s = {\rm round(Q S)}\).
  3. Use the corresponding \(\theta^{(s)}\) from the list of sorted posterior draws as quantiles from the error distribution. These quantiles can be used to approximate the Monte Carlo standard error.

3.8 Efficiency estimate for small interval probability estimates

We can get more local efficiency estimates by considering small probability intervals. We propose to compute the efficiency estimates for \[ \bar{I}_{\alpha,\delta} = p(\hat{Q}_\alpha < \theta \leq \hat{Q}_{\alpha+\delta}), \] where \(\hat{Q}_\alpha\) is an empirical \(\alpha\)-quantile, \(\delta=1/k\) is the length of the interval with some positive integer \(k\), and \(\alpha \in (0,\delta,\ldots,1-\delta)\) changes in steps of \(\delta\). Each interval has \(S/k\) draws, and the efficiency measures autocorrelation of an indicator function which is \(1\) when the values are inside the specific interval and \(0\) otherwise. This gives us a local efficiency measure which does not depend on the shape of the distribution.

3.9 Rank plots

In addition to using rank-normalized values to compute split-\(\widehat{R}\), we propose to use rank plots for each chain instead of trace plots. Rank plots are nothing else than histograms of the ranked posterior samples (ranked over all chains) plotted separately for each chain. If rank plots of all chains look similar, this indicates good mixing of the chains. As compared to trace plots, rank plots don’t tend to squeeze to a mess in case of long chains.

3.10 Proposed changes in Stan

The proposal is to switch in Stan:

  • from split-\(\widehat{R}\) to the maximum of rank-normalized-split-\(\widehat{R}\) and rank-normalized-folded-split-\(\widehat{R}\)
  • from the classic effective sample size estimate (Neff), which currently doesn’t use chain splitting, to rank-normalized-split-\(S_{\rm eff}\) (bulk-ESS) and rank-normalized-folded-split-\(S_{\rm eff}\) (tail-ESS)

Justifications for the changes are:

  • Rank normalization makes \(\widehat{R}\) and effective sample size measures well defined for all distributions, invariant under bijective transformations, and more stable than their classical counterparts.
  • Adding folded versions of \(\widehat{R}\) and effective sample size estimates helps in detecting scale differences across chains.

In summary outputs, we propose to use Rhat to denote also the new version. As \(n\) and \(N\) often refer to the number of observations, we propose to use acronym ESS for the effective sample size.

3.11 Proposed additions to bayesplot

We propose to add to the bayesplot package:

  • Rank plots
  • Plots of efficiency estimates for quantiles
  • Plots of efficiency estimates for small probability intervals
  • Plots of changes in efficiency estimates with increasing number of draws

3.12 Warning thresholds

Based on the experiments presented in Appendices D-F, more strict convergence diagnostics and effective sample size warning limits could be used. We propose the following warning thresholds although additional experiments would be useful:

  • Rhat > 1.01
  • ESS < 400

In case of running 4 chains, an effective sample size of 400 corresponds to having an effective sample size of 50 for each 8 split chains, which we consider to be minimum for reliable mean, variance and autocorrelation estimates needed for the convergence diagnostic. We recommend running at least 4 chains to get reliable between chain variances for the convergence diagnostics.

Plots shown in the upcoming sections have dashed lines based on these thresholds (in continuous plots, a dashed line at 1.005 is plotted instead of 1.01, as values larger than that are usually rounded in our summaries to 1.01).

4 Examples

In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices D-G contain more detailed analysis of different algorithm variants and further examples.

First, we load all the necessary R packages and additional functions.

library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(rjags)
library(abind)
source('monitornew.R')
source('monitorplot.R')

4.1 Cauchy: A distribution with infinite mean and variance

The classic split-\(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e., infinite), the classic split-\(\widehat{R}\) is not well justified. In this section, we will use the Cauchy distribution as an example of such a distribution. Appendix E contains more detailed analysis of different algorithm variants and further Cauchy examples.

The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution

4.1.1 Nominal parameterization of Cauchy

The nominal Cauchy model with direct parameterization is as follows:

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

4.1.1.1 Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1233 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Treedepth exceedence and Bayesian fraction of missing information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about a very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.

mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

           Q5     Q50     Q95    Mean      SD  Rhat Bulk_ESS Tail_ESS
x[1]   -5.738  -0.015   6.313   2.505  36.377 1.028     1181      393
x[2]   -5.828  -0.012   6.067   0.648  16.217 1.007     2645      502
x[3]   -5.235   0.009   5.729   0.575  17.723 1.012     2683      823
x[4]   -6.250  -0.020   6.897   0.163  11.487 1.010     3627      644
x[5]   -9.663  -0.048   5.109  -1.413  11.271 1.009      629      156
x[6]   -5.257  -0.003   5.406   0.197   5.318 1.011     3060      883
x[7]   -6.354   0.059  10.772   4.144  32.565 1.017      607      184
x[8]   -6.445  -0.012   5.375  -0.219   7.662 1.004     2658      886
x[9]   -6.531  -0.004   6.295   0.152   7.382 1.006     3128      901
x[10]  -6.122  -0.011   5.916   0.056   5.743 1.006     2421      642
x[11]  -6.728   0.001   6.150   0.034   9.682 1.008     2079      600
x[12]  -5.678  -0.033   4.883   0.193  17.526 1.003     2633      774
x[13]  -4.525  -0.055   4.269   0.090   6.283 1.003     3148      811
x[14]  -4.880  -0.002   5.028  -0.023   5.931 1.004     1461      492
x[15] -14.490  -0.006  11.514  -1.411  22.739 1.032      486      160
x[16]  -7.028   0.013   6.959   0.162  15.913 1.014     2329      463
x[17]  -6.585   0.012   7.691   0.964  30.185 1.007     2292      446
x[18]  -4.505   0.054   6.915   1.099  11.893 1.015     2640      447
x[19]  -7.659  -0.047   6.079  -3.141  28.144 1.007     1147      298
x[20]  -5.782   0.026  11.335   5.777  36.989 1.029      363       80
x[21]  -4.894   0.022   5.379   0.105   5.453 1.008     3276      824
x[22]  -5.594   0.042   5.372   0.493  15.375 1.010     2121      522
x[23] -14.448   0.008   6.995  -3.096  32.020 1.012      391       89
x[24]  -5.932  -0.036   5.471  -1.007  16.613 1.015     1434      284
x[25]  -7.068  -0.023   5.944  -1.758  21.257 1.009     1544      324
x[26]  -8.962  -0.062   5.847  -1.970  17.357 1.011     1778      452
x[27]  -8.813  -0.006   8.345   0.153  18.079 1.005     1816      352
x[28]  -5.256   0.024   5.927   0.105   9.511 1.017     3776      675
x[29]  -5.881  -0.002   5.895  -0.036  18.470 1.009     3642      846
x[30]  -5.572  -0.016   5.141  -0.185   6.221 1.001     4363      643
x[31]  -7.360   0.014   7.255   0.069   8.715 1.002     3384      896
x[32]  -8.850  -0.038   6.497 -10.220  81.180 1.012      561      141
x[33]  -4.787   0.031   4.959  -0.348   8.611 1.007     2626      735
x[34]  -5.906  -0.034   5.372  -0.108   5.956 1.006     2408      634
x[35]  -6.103   0.017   6.790  -0.328  14.648 1.010     2654      630
x[36]  -4.861   0.097  15.598  19.301 108.424 1.044      155       34
x[37]  -8.932  -0.001   7.856  -0.090  13.300 1.006     1155      427
x[38]  -5.543  -0.023   4.830  -0.327   5.950 1.003     3353      576
x[39]  -5.895  -0.027   5.836   0.107   7.294 1.016     4493      724
x[40]  -8.707   0.012   6.715  -1.978  19.806 1.003      877      148
x[41]  -6.884  -0.027   7.548  -0.270  11.284 1.003     1726      512
x[42]  -6.155   0.014   6.480   0.091  15.506 1.014     1519      448
x[43]  -6.385   0.000   7.392   0.113  12.102 1.009     2746      483
x[44]  -7.843   0.005   7.992   0.414  11.498 1.008     2999      659
x[45]  -4.766  -0.020   4.662   0.036   6.726 1.007     2904     1014
x[46]  -4.840   0.004   5.986   1.161  22.071 1.004      955      338
x[47]  -8.507   0.034  24.263   0.796  37.140 1.071      231       56
x[48]  -6.734   0.001   5.331  -0.717  10.349 1.008     1907      469
x[49]  -6.272   0.044   6.919   0.730  12.441 1.004     1490      390
x[50]  -5.211   0.001   4.645  -0.221   7.078 1.001     3109      680
I       0.000   0.500   1.000   0.500   0.500 1.000      390     4000
lp__  -92.706 -69.208 -49.032 -69.536  13.349 1.054      117      323

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Tail_ESS'])

Several Rhat > 1.01 and some ESS < 400 indicate also that the results should not be trusted. The Appendix E has more results with longer chains as well.

We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[36], which has the smallest taill-ESS of 34.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section Efficiency estimate for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)

We see that the efficiency of our posterior draws is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show iterations that exceeded the maximum treedepth.

An alternative way to examine the efficiency in different parts of the posterior is to compute efficiencies estimates for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.

plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)

Similar as above, we see that the efficiency of our posterior draws is worryingly low in the tails. Again, orange ticks show iterations that exceeded the maximum treedepth.

We may also investigate how the estimated effective sample sizes change when we use more and more draws (Brooks and Gelman (1998) proposed to use similar graph for \(\widehat{R}\)). If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.

plot_change_ess(fit = fit_nom, par = which_min_ess)

We can further analyze potential problems using rank plots in which we clearly see differences between chains.

samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

4.1.2 Alternative parameterization of Cauchy

Next we examine an alternative parameterization that considers the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\)’s can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the alternative model:

fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster.

mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

             Q5     Q50     Q95     Mean       SD  Rhat Bulk_ESS Tail_ESS
x_a[1]   -1.648  -0.020   1.657   -0.015    0.992 1.001     4361     2961
x_a[2]   -1.666  -0.002   1.631   -0.012    1.008 1.001     4015     3167
x_a[3]   -1.662  -0.026   1.617   -0.010    0.997 1.001     4190     3013
x_a[4]   -1.625  -0.011   1.691    0.006    1.009 1.000     3626     2822
x_a[5]   -1.636  -0.007   1.578   -0.011    0.985 1.002     3447     3029
x_a[6]   -1.623  -0.009   1.558   -0.011    0.971 1.001     3975     2795
x_a[7]   -1.607   0.003   1.653    0.011    1.008 1.000     3919     2271
x_a[8]   -1.672  -0.028   1.604   -0.024    1.005 1.001     3736     3040
x_a[9]   -1.626  -0.037   1.549   -0.029    0.986 1.002     3852     3075
x_a[10]  -1.655  -0.033   1.738    0.000    1.022 1.002     3457     2810
x_a[11]  -1.555  -0.016   1.551   -0.013    0.966 1.002     3532     2822
x_a[12]  -1.638   0.002   1.690    0.009    1.004 1.001     3462     3036
x_a[13]  -1.590   0.012   1.635    0.008    0.989 1.000     3479     2764
x_a[14]  -1.718  -0.026   1.625   -0.029    1.013 1.002     3872     3179
x_a[15]  -1.650   0.015   1.697    0.023    1.018 1.000     3843     2855
x_a[16]  -1.675  -0.006   1.624   -0.012    1.009 1.001     3610     3047
x_a[17]  -1.631  -0.008   1.662   -0.007    0.989 1.001     4363     3076
x_a[18]  -1.654  -0.010   1.688   -0.008    1.005 1.001     4636     3088
x_a[19]  -1.613  -0.020   1.588   -0.023    0.975 1.000     4055     3037
x_a[20]  -1.643  -0.006   1.618    0.000    1.001 1.001     4340     2792
x_a[21]  -1.671   0.040   1.683    0.025    1.033 1.000     3628     2719
x_a[22]  -1.628  -0.013   1.645   -0.010    1.009 1.000     3976     2775
x_a[23]  -1.617   0.034   1.698    0.018    1.013 1.002     4266     2682
x_a[24]  -1.662   0.002   1.667   -0.001    1.000 1.002     3710     2889
x_a[25]  -1.636   0.016   1.663    0.004    0.996 1.001     4408     3062
x_a[26]  -1.647  -0.017   1.613   -0.005    1.003 1.000     3854     2674
x_a[27]  -1.663  -0.001   1.680    0.001    1.013 1.002     3331     2944
x_a[28]  -1.687   0.018   1.681    0.016    1.014 1.002     4055     2689
x_a[29]  -1.575   0.002   1.642    0.018    0.990 1.000     4097     3213
x_a[30]  -1.609  -0.005   1.644    0.011    0.988 1.000     3961     2900
x_a[31]  -1.648   0.013   1.605   -0.005    0.994 1.000     4097     3088
x_a[32]  -1.588  -0.006   1.590   -0.010    0.969 1.000     4189     3171
x_a[33]  -1.688  -0.010   1.770    0.004    1.049 1.001     3853     2594
x_a[34]  -1.684   0.009   1.691   -0.004    1.006 1.000     4012     2787
x_a[35]  -1.613   0.026   1.675    0.027    1.000 1.001     3920     2829
x_a[36]  -1.667   0.004   1.604   -0.007    0.988 1.001     4107     2797
x_a[37]  -1.687  -0.027   1.593   -0.018    1.000 1.002     3956     2942
x_a[38]  -1.666  -0.018   1.638   -0.016    1.018 1.002     4210     2919
x_a[39]  -1.655  -0.025   1.647   -0.016    1.016 1.000     4482     2869
x_a[40]  -1.629   0.003   1.618    0.000    0.999 1.003     4139     2848
x_a[41]  -1.684   0.016   1.614   -0.003    0.990 1.004     3963     2962
x_a[42]  -1.645   0.028   1.636    0.020    1.001 1.001     4161     3077
x_a[43]  -1.638  -0.030   1.633   -0.015    1.008 1.000     3808     2849
x_a[44]  -1.655  -0.033   1.630   -0.017    1.004 1.001     3743     2865
x_a[45]  -1.682   0.019   1.731    0.009    1.017 1.001     3783     2546
x_a[46]  -1.560   0.044   1.664    0.049    0.970 1.000     4340     3277
x_a[47]  -1.660   0.018   1.583   -0.004    0.996 1.000     4283     2946
x_a[48]  -1.607   0.003   1.656    0.002    0.981 1.000     4001     2779
x_a[49]  -1.617   0.002   1.641    0.003    0.998 1.000     3906     3099
x_a[50]  -1.625  -0.006   1.583    0.000    0.987 1.001     3794     2962
x_b[1]    0.003   0.445   3.997    1.002    1.443 1.004     2268     1264
x_b[2]    0.004   0.456   3.752    0.996    1.380 1.002     2444     1428
x_b[3]    0.004   0.467   3.892    1.035    1.457 1.000     3578     1950
x_b[4]    0.004   0.456   3.830    1.010    1.406 1.001     2693     1342
x_b[5]    0.004   0.452   3.951    1.026    1.463 1.000     3056     1731
x_b[6]    0.004   0.440   3.803    1.001    1.412 1.001     3264     1786
x_b[7]    0.005   0.432   3.618    0.949    1.324 1.000     2888     1907
x_b[8]    0.004   0.443   3.792    1.008    1.403 1.000     2876     1578
x_b[9]    0.005   0.496   3.674    0.998    1.364 1.000     2820     1535
x_b[10]   0.003   0.441   3.789    0.987    1.392 1.000     2534     1699
x_b[11]   0.006   0.495   3.900    1.028    1.438 1.001     3595     2032
x_b[12]   0.004   0.440   3.822    0.987    1.401 1.002     2405     1200
x_b[13]   0.003   0.462   3.973    1.031    1.454 1.000     2045     1073
x_b[14]   0.004   0.441   3.970    1.032    1.486 1.000     2829     1443
x_b[15]   0.005   0.450   3.770    1.011    1.410 1.000     2853     1447
x_b[16]   0.004   0.477   3.794    1.006    1.426 1.000     2661     1604
x_b[17]   0.006   0.459   3.929    1.023    1.425 1.001     2775     1477
x_b[18]   0.006   0.505   4.106    1.065    1.534 1.001     2689     1170
x_b[19]   0.004   0.450   3.918    1.001    1.408 1.000     2392     1450
x_b[20]   0.003   0.425   3.964    0.985    1.423 1.003     2296     1240
x_b[21]   0.005   0.488   3.939    1.037    1.431 1.000     3069     1848
x_b[22]   0.004   0.451   3.946    1.016    1.455 1.000     3012     1733
x_b[23]   0.004   0.459   3.945    1.005    1.420 1.002     1787     1093
x_b[24]   0.003   0.436   3.862    1.002    1.428 1.000     1903     1008
x_b[25]   0.004   0.452   3.661    0.979    1.390 1.000     2348     1094
x_b[26]   0.005   0.472   4.046    1.029    1.457 1.001     2421     1549
x_b[27]   0.004   0.453   3.900    1.011    1.412 1.000     2777     1470
x_b[28]   0.004   0.456   3.792    0.984    1.368 1.000     3353     1699
x_b[29]   0.006   0.458   3.871    1.014    1.430 1.000     3428     1997
x_b[30]   0.004   0.437   3.885    1.009    1.430 1.000     2833     1554
x_b[31]   0.004   0.487   3.837    1.015    1.408 1.001     3035     1633
x_b[32]   0.003   0.432   3.753    0.966    1.359 1.001     2276     1602
x_b[33]   0.004   0.452   3.789    1.000    1.408 1.000     3093     1888
x_b[34]   0.004   0.468   3.970    1.026    1.452 1.000     3309     1650
x_b[35]   0.004   0.482   3.841    1.015    1.424 1.002     2493     1588
x_b[36]   0.005   0.443   3.894    0.988    1.395 1.000     3108     1876
x_b[37]   0.004   0.459   3.702    0.979    1.352 1.000     2644     1322
x_b[38]   0.005   0.454   3.930    1.012    1.449 1.001     3155     1776
x_b[39]   0.004   0.450   3.752    0.987    1.417 1.001     2038      934
x_b[40]   0.003   0.419   3.761    0.938    1.313 1.000     2657     1403
x_b[41]   0.005   0.461   3.784    0.998    1.379 1.000     2648     1370
x_b[42]   0.002   0.450   3.894    1.001    1.426 1.001     2334     1365
x_b[43]   0.004   0.470   4.030    1.032    1.443 1.000     2967     1797
x_b[44]   0.004   0.430   3.688    0.975    1.367 1.000     2557     1591
x_b[45]   0.005   0.442   3.656    0.962    1.300 1.001     2731     1785
x_b[46]   0.004   0.458   3.745    1.014    1.402 1.001     2538     1183
x_b[47]   0.007   0.472   3.833    1.007    1.392 1.001     3948     2071
x_b[48]   0.006   0.480   3.891    1.020    1.394 1.000     3207     1917
x_b[49]   0.005   0.474   3.737    0.999    1.347 1.001     2550     1533
x_b[50]   0.003   0.460   4.012    1.004    1.479 1.001     2881     1395
x[1]     -6.472  -0.019   6.520    0.011   34.878 1.006     3901     2122
x[2]     -6.516  -0.002   6.499    3.385  145.630 1.001     3767     1947
x[3]     -6.313  -0.030   5.948   -0.083   16.172 1.001     3681     2514
x[4]     -6.764  -0.013   5.864   -1.256   50.733 1.000     3244     2159
x[5]     -6.673  -0.012   5.755   -0.313   30.289 1.002     3355     2430
x[6]     -5.640  -0.013   6.476 -145.798 6461.971 1.000     3802     2536
x[7]     -6.593   0.003   6.375   -0.581   22.645 1.000     3480     2508
x[8]     -6.496  -0.035   6.294   -0.037   20.613 1.001     3474     2374
x[9]     -5.731  -0.044   5.961   -0.920   55.636 1.002     3493     2262
x[10]    -6.369  -0.042   6.430   -0.199   25.217 1.001     2871     2286
x[11]    -5.613  -0.020   5.549    0.120   12.361 1.001     3423     2587
x[12]    -7.122   0.002   6.187    0.407   91.826 1.003     3340     2329
x[13]    -6.444   0.020   6.216   -5.647  201.971 1.001     3332     2164
x[14]    -6.716  -0.040   6.564   -2.937  134.969 1.000     3693     2257
x[15]    -5.681   0.017   6.059   -0.864   30.387 1.001     3511     1969
x[16]    -6.987  -0.011   7.244   -0.402   24.589 1.002     3614     2406
x[17]    -5.778  -0.013   5.387   -0.063   25.139 1.001     3880     2520
x[18]    -5.450  -0.012   5.915   -0.550  259.421 1.002     4303     2254
x[19]    -7.157  -0.024   5.841    9.852  549.143 1.000     3505     1931
x[20]    -7.392  -0.008   6.677   -3.089  145.629 1.004     3411     1916
x[21]    -5.933   0.049   5.852    0.229   45.618 1.000     3419     2179
x[22]    -6.956  -0.019   6.226   -0.196   27.060 1.000     3554     2149
x[23]    -5.993   0.045   6.540    2.509  114.592 1.001     3456     2098
x[24]    -6.747   0.002   6.991   -0.672  111.425 1.003     3541     1922
x[25]    -6.326   0.018   6.398   -2.711  100.245 1.000     4036     2356
x[26]    -5.571  -0.021   6.319    0.195   14.470 1.001     3335     2319
x[27]    -6.401  -0.002   6.167   -0.784   52.601 1.001     3377     2400
x[28]    -5.698   0.023   6.490    3.105  105.247 1.000     3446     2121
x[29]    -5.403   0.001   5.672   -0.124   18.134 1.000     3918     2430
x[30]    -5.830  -0.005   6.453    0.998   51.195 1.000     3642     2138
x[31]    -5.852   0.017   5.660    0.864   26.311 1.002     3595     2271
x[32]    -6.487  -0.013   6.184    0.114   50.688 1.001     4227     2731
x[33]    -7.079  -0.008   6.339   -0.410   23.461 1.001     3807     2376
x[34]    -6.590   0.011   6.051    0.953   48.192 1.001     4084     2358
x[35]    -5.438   0.030   6.130    0.318   48.428 1.001     3756     2247
x[36]    -6.153   0.004   6.110   -0.057   28.418 1.000     3600     2386
x[37]    -6.076  -0.028   5.339    0.700   59.912 1.001     3623     2005
x[38]    -5.741  -0.021   5.767    0.200   13.158 1.003     3820     2536
x[39]    -6.734  -0.034   5.840    2.770  285.451 1.001     4121     1944
x[40]    -6.390   0.006   6.517   -2.748  102.743 1.000     3612     1823
x[41]    -6.012   0.022   5.923   -0.421   35.694 1.001     3492     2222
x[42]    -7.394   0.045   6.859    0.495   22.471 1.002     3558     1949
x[43]    -5.982  -0.034   6.685   14.356  626.203 1.000     3823     2516
x[44]    -7.035  -0.039   6.174    1.545  106.262 1.001     3310     2239
x[45]    -5.747   0.021   6.227   -0.427   32.277 1.001     3752     2437
x[46]    -5.593   0.060   6.331   -0.318   75.995 1.002     3898     1976
x[47]    -5.492   0.029   5.426   -0.021   12.082 1.001     3893     2659
x[48]    -5.963   0.003   4.850   -0.013   21.020 1.000     3674     2274
x[49]    -6.546   0.003   5.248   -1.235  129.152 1.002     3576     2243
x[50]    -6.735  -0.008   6.895   -1.061  147.106 1.001     3437     2486
I         0.000   0.000   1.000    0.499    0.500 1.002     2648     4000
lp__    -95.188 -80.991 -68.660  -81.342    8.076 1.002     1310     1928

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[101:150, 'Tail_ESS'])

All Rhat < 1.01 and ESS > 400 indicate the sampling worked much better with the alternative parameterization. Appendix E has more results using other alternative parameterizations. The x_a and x_b used to form the Cauchy distributed x have stable quantile, mean and sd values. As x is Cauchy distributed it has only stable quantiles, but wildly varying mean and sd estimates as the true values are not finite.

We can further analyze potential problems using local efficiency estimates and rank plots. We take a detailed look at x[40], which has the smallest bulk-ESS of 2848.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.

plot_local_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 20)

The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 40)

Rank plots also look rather similar across chains.

samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

In summary, the alternative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail-ESS.

4.1.3 Half-Cauchy with nominal parameterization

Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the half-Cauchy with nominal parameterization (and positive constraint):

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the Cauchy nominal model.

mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

           Q5     Q50     Q95    Mean       SD  Rhat Bulk_ESS Tail_ESS
x[1]    0.081   1.033  13.559  11.119  388.184 1.004     8077     2223
x[2]    0.096   1.017  10.788  11.640  482.364 1.000     9868     2612
x[3]    0.063   1.002  12.855   5.403   71.621 1.000     7895     2097
x[4]    0.094   0.982  11.501   4.405   29.482 1.000     7596     2347
x[5]    0.076   1.034  14.015   4.518   19.853 1.000     7495     2230
x[6]    0.081   0.992  11.716  12.523  379.984 1.000     7948     2145
x[7]    0.073   0.978  12.022   9.227  280.919 1.000     8336     2117
x[8]    0.090   0.988  10.818   4.615   40.828 1.000     8194     2165
x[9]    0.090   1.020  11.555  11.672  483.091 1.001     8464     2127
x[10]   0.072   1.026  14.900   6.007   50.981 1.000     7963     2399
x[11]   0.075   0.973  12.544  15.546  532.285 1.002     6980     1788
x[12]   0.073   1.031  13.331   4.099   17.470 1.001     8226     2029
x[13]   0.077   1.012  14.110   8.881  120.144 1.003     8077     2032
x[14]   0.068   0.960  13.479   6.910   71.522 1.002     7455     2261
x[15]   0.070   0.996  14.426   8.658   95.803 1.003     6766     2246
x[16]   0.075   0.991  14.058   5.554   38.453 1.001     7396     2174
x[17]   0.082   0.986  11.704   8.699  299.762 1.001     7144     2260
x[18]   0.085   0.983  11.561  14.423  397.592 1.001     7614     2035
x[19]   0.073   0.988  14.315   6.334   69.060 1.001     7823     1825
x[20]   0.090   1.001  11.205   6.402  160.214 1.000     7905     2355
x[21]   0.073   0.997  12.011   7.028  144.132 0.999     7843     2078
x[22]   0.087   1.022  12.693  32.183 1733.463 1.000     7735     2043
x[23]   0.070   0.981  13.460   6.325   67.345 1.002     7119     2142
x[24]   0.073   0.996  12.310   4.985   41.550 1.002     6893     1982
x[25]   0.090   1.000  11.527   8.602  270.332 1.002     7757     2351
x[26]   0.085   0.985  10.679   6.968  119.110 1.001     6433     2230
x[27]   0.077   1.009  13.373   4.882   34.296 1.000     7005     2119
x[28]   0.081   0.972  11.410   5.757   66.701 1.002     9631     2228
x[29]   0.074   0.996  13.681  10.374  239.906 1.001     6109     2312
x[30]   0.089   1.025  10.986   7.981  206.413 1.002     7958     2368
x[31]   0.078   0.964  12.417   4.395   32.129 1.001     6493     2102
x[32]   0.097   1.012  11.460   4.987   55.377 1.002     7043     1742
x[33]   0.076   0.987  13.765   5.455   36.802 1.000     6913     2455
x[34]   0.077   0.981  12.378   5.922   78.208 1.002     8610     2514
x[35]   0.075   0.965  13.567   5.696   57.446 1.002     6406     2160
x[36]   0.064   1.003  13.540   5.365   39.600 1.001     7694     2031
x[37]   0.071   1.000  13.867   8.289  195.606 1.001     7276     2491
x[38]   0.083   1.018  12.235   6.399  119.056 1.002     6790     2369
x[39]   0.095   1.013  11.687   6.667   88.506 1.000     7739     2518
x[40]   0.092   0.963  12.061   5.158   44.414 1.000     7087     2349
x[41]   0.070   0.962  12.868   4.969   42.302 1.000     8650     2333
x[42]   0.071   1.020  13.331   7.770  132.343 1.004     8703     2410
x[43]   0.076   0.970  10.254  26.620 1404.418 1.003     8747     2194
x[44]   0.079   1.035  12.290   5.961   59.293 1.003     6378     2257
x[45]   0.080   0.980  12.219   7.703  123.023 1.001     8430     2314
x[46]   0.082   0.990  12.058   5.358   72.186 1.001     8185     2237
x[47]   0.084   1.009  14.500   7.003   76.721 1.000     9562     2265
x[48]   0.076   1.007  13.053   5.063   30.438 1.000     8402     2690
x[49]   0.079   0.999  12.928   7.479   99.983 1.001     7993     1804
x[50]   0.084   0.997  13.215   8.861  178.683 1.002     7523     2243
I       0.000   0.000   1.000   0.487    0.500 0.999     7357     4000
lp__  -80.633 -69.062 -59.314 -69.324    6.415 1.002     1218     2001

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhat < 1.01 and ESS > 400 indicate good performance of the sampler. We see that the Stan’s automatic (and implicit) transformation of constraint parameters can have a big effect on the sampling performance. More experiments with different parameterizations of the half-Cauchy distribution can be found in Appendix E.

4.2 Hierarchical model: Eight Schools

The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which despite the apparent simplicity nicely illustrates the typical problems in inference for hierarchical models. The Stan models below are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences. Appendix F contains more detailed analysis of different algorithm variants.

4.2.1 A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

4.2.1.1 Centered Eight Schools model

We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.

eight_schools <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 113 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

Despite an increased adapt_delta, we still observe a lot of divergent transitions, which in itself is already sufficient indicator to not trust the results. We can use Rhat and ESS diagnostics to recognize problematic parts of the posterior and they could be used in cases when other MCMC algorithms than HMC is used.

mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

              Q5     Q50    Q95    Mean    SD  Rhat Bulk_ESS Tail_ESS
mu        -1.113   4.531  9.895   4.441 3.386 1.017      548      754
tau        0.391   2.846  9.611   3.620 3.098 1.069       67       82
theta[1]  -2.238   5.814 16.293   6.226 5.739 1.016      747     1294
theta[2]  -2.601   5.067 13.433   5.077 4.860 1.012      970     1240
theta[3]  -5.013   4.355 12.111   3.938 5.332 1.009      899     1147
theta[4]  -2.858   4.999 12.816   4.893 4.825 1.010      986     1059
theta[5]  -4.742   4.030 10.829   3.661 4.808 1.013      715      988
theta[6]  -4.155   4.281 11.595   4.078 4.838 1.010      833      976
theta[7]  -1.299   5.974 15.589   6.311 5.184 1.019      612     1182
theta[8]  -3.371   5.117 13.844   4.975 5.338 1.010      901     1477
lp__     -24.682 -15.041  0.371 -14.005 7.438 1.068       69       89

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

See Appendix F for results of longer chains.

Bulk-ESS and Tail-ESS for the between school standard deviation tau are 67 and 82 respectively. Both are less than 400, indicating we should investigate that parameter more carefully. We thus examine the sampling efficiency in different parts of the posterior by computing the efficiency estimate for small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.

plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20)

plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)

We see that the sampler has difficulties in exploring small tau values. As the sampling efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, indicate also problems exploring small values which is likely to cause bias.

We examine also the sampling efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.

plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40)

plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40, rank = FALSE)

Most of the quantile estimates have worryingly low effective sample size.

Let’s see how the estimated effective sample size changes when we use more and more draws. Here we don’t see sudden changes, but both bulk-ESS and tail-ESS are too low. See Appendix F for results of longer chains.

plot_change_ess(fit = fit_cp, par = "tau")

In lines with these findings, the rank plots of tau clearly show problems in the mixing of the chains.

samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])

4.2.2 Non-centered Eight Schools model

For hierarchical models, the non-centered parameterization often works better than the centered one:

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:

fit_ncp2 <- stan(
  file = 'eight_schools_ncp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)

We get zero divergences and no other warnings which is a first good sign.

mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                    Q5    Q50    Q95   Mean    SD  Rhat Bulk_ESS Tail_ESS
mu              -1.136  4.418  9.960  4.466 3.367 1.000     5531     3004
tau              0.299  2.810  9.501  3.588 3.174 1.000     2872     1908
theta_tilde[1]  -1.295  0.307  1.880  0.307 0.979 1.003     5046     2874
theta_tilde[2]  -1.445  0.109  1.623  0.101 0.937 1.001     4177     2735
theta_tilde[3]  -1.644 -0.075  1.481 -0.078 0.966 1.001     6485     2994
theta_tilde[4]  -1.512  0.084  1.622  0.064 0.947 1.003     6076     2514
theta_tilde[5]  -1.710 -0.173  1.387 -0.162 0.939 1.002     5608     3177
theta_tilde[6]  -1.642 -0.060  1.528 -0.065 0.967 1.001     4855     2773
theta_tilde[7]  -1.252  0.396  1.867  0.367 0.955 1.001     4796     2849
theta_tilde[8]  -1.536  0.072  1.662  0.062 0.964 1.000     6142     2972
theta[1]        -1.619  5.639 16.310  6.246 5.576 1.003     4907     3015
theta[2]        -2.415  4.819 12.957  5.013 4.680 1.000     5122     3242
theta[3]        -4.631  4.261 12.105  4.090 5.270 1.000     5457     3407
theta[4]        -2.733  4.754 12.485  4.817 4.780 1.001     4695     3130
theta[5]        -3.956  3.820 11.023  3.717 4.623 1.001     5346     3398
theta[6]        -3.883  4.271 11.437  4.127 4.953 1.000     5670     3393
theta[7]        -0.979  5.880 15.429  6.375 5.101 1.000     4708     3242
theta[8]        -3.234  4.780 13.129  4.869 5.175 1.000     4924     3037
lp__           -11.193 -6.600 -3.776 -6.925 2.314 1.001     1641     2344

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhat < 1.01 and ESS > 400 indicate a much better performance of the non-centered parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates for tau.

plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 20)

Small tau values are still more difficult to explore, but the relative efficiency is in a good range. We may also check this with a finer resolution:

plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 50)

The sampling efficiency for different quantile estimates looks good as well.

plot_quantile_ess(fit = fit_ncp2, par = 2, nalpha = 40)

In line with these findings, the rank plots of tau show no substantial differences between chains.

samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])

References

Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.

Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.

Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.

Gelman, A. and Rubin, D. B. (1992) ‘Inference from iterative simulation using multiple sequences’, Statistical science, 7(4), pp. 457–472.

Geyer, C. J. (1992) ‘Practical Markov chain Monte Carlo’, Statistical Science, 7, pp. 473–483.

Geyer, C. J. (2011) ‘Introduction to Markov chain Monte Carlo’, in Brooks, S. et al. (eds) Handbook of markov chain monte carlo. CRC Press.

Hoffman, M. D. and Gelman, A. (2014) ‘The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo’, Journal of Machine Learning Research, 15, pp. 1593–1623. Available at: http://jmlr.org/papers/v15/hoffman14a.html.

Stan Development Team (2018a) Bayesian statistics using stan. Stan Development Team. Available at: https://github.com/stan-dev/stan-book.

Stan Development Team (2018b) ‘RStanArm: Bayesian applied regression modeling via Stan. R package version 2.17.4’. Available at: http://mc-stan.org.

Stan Development Team (2018c) ‘RStan: The R interface to Stan. R package version 2.17.3’. Available at: http://mc-stan.org.

Stan Development Team (2018d) ‘Stan modeling language users guide and reference manual. Version 2.18.0’. Available at: http://mc-stan.org.

Stan Development Team (2018e) ‘The Stan core library version 2.18.0’. Available at: http://mc-stan.org.

Appendices

Appendix A: Abbreviations

The following abbreviations are used throughout the appendices:

  • N = total number of draws
  • Rhat = classic no-split-Rhat
  • sRhat = classic split-Rhat
  • zsRhat = rank-normalized split-Rhat
    • all chains are jointly ranked and z-transformed
    • can detect differences in location and trends
  • zfsRhat = rank-normalized folded split-Rhat
    • all chains are jointly “folded” by computing absolute deviation from median, ranked and z-transformed
    • can detect differences in scales
  • seff = no-split effective sample size
  • reff = seff / N
  • zsseff = rank-normalized split effective sample size
    • estimates the efficiency of mean estimate for rank normalized values
  • zsreff = zsseff / N
  • zfsseff = rank-normalized folded split effective sample size
    • estimates the efficiency of rank normalized mean absolute deviation
  • zfsreff = zfsseff / N
  • tailseff = minimum of rank-normalized split effective sample sizes of the 5% and 95% quantiles
  • tailreff = tailseff / N
  • medsseff = median split effective sample size
    • estimates the efficiency of the median
  • medsreff = medsseff / N
  • madsseff = mad split effective sample size
    • estimates the efficiency of the median absolute deviation
  • madsreff = madsseff / N

Appendix B: Examples of rank normalization

We will illustrate the rank normalization with a few examples. First, we plot histograms, and empirical cumulative distribution functions (ECDF) with respect to the original parameter values (\(\theta\)), scaled ranks (ranks divided by the maximum rank), and rank normalized values (z). We used scaled ranks to make the plots look similar for different number of draws.

100 draws from Normal(0, 1):

n <- 100
theta <- rnorm(n)
plot_ranknorm(theta, n)

100 draws from Exponential(1):

theta <- rexp(n)
plot_ranknorm(theta, n)

100 draws from Cauchy(0, 1):

theta <- rcauchy(n)
plot_ranknorm(theta, n)

In the above plots, the ECDF with respect to scaled rank and rank normalized \(z\)-values look exactly the same for all distributions. In Split-\(\widehat{R}\) and effective sample size computations, we rank all draws jointly, but then compare ranks and ECDF of individual split chains. To illustrate the variation between chains, we draw 8 batches of 100 draws each from Normal(0, 1):

n <- 100
m <- 8
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m)

The variation in ECDF due to the variation ranks is now visible also in scaled ranks and rank normalized \(z\)-values from different batches (chains).

The benefit of rank normalization is more obvious for non-normal distribution such as Cauchy:

theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m)

Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get the same results, for example, if we use unconstrained or constrained parameterisations in a model.

Appendix C: Variance of the cumulative distribution function

In Section 3, we had defined the empirical CDF (ECDF) for any \(\theta_\alpha\) as \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \]

For independent draws, \(\bar{I}_\alpha\) has a \({\rm Beta}(\bar{I}_\alpha+1, S - \bar{I}_\alpha + 1)\) distribution. Thus we can easily examine the variation of the ECDF for any \(\theta_\alpha\) value from a single chain. If \(\bar{I}_\alpha\) is not very close to \(1\) or \(S\) and \(S\) is large, we can use the variance of Beta distribution

\[ {\rm Var}[p(\theta \leq \theta_\alpha)] = \frac{(\bar{I}_\alpha+1)*(S-\bar{I}_\alpha+1)}{(S+2)^2(S+3)}. \] We illustrate uncertainty intervals of the Beta distribution and normal approximation of ECDF for 100 draws from Normal(0, 1):

n <- 100
m <- 1
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)

Uncertainty intervals of ECDF for draws from Cauchy(0, 1) illustrate again the improved visual clarity in plotting when using scaled ranks:

n <- 100
m <- 1
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)

The above plots illustrate that the normal approximation is accurate for practical purposes in MCMC diagnostics.

Appendix D: Normal distributions with additional trend, shift or scaling

This part focuses on diagnostics for

  • all chains having a trend and a similar marginal distribution
  • one of the chains having a different mean
  • one of the chains having a lower marginal variance

To simplify, in this part, independent draws are used as a proxy for very efficient MCMC sampling. First, we sample draws from a standard-normal distribution. We will discuss the behavior for non-normal distributions later. See Appendix A for the abbreviations used.

4.2.3 Adding the same trend to all chains

All chains are from the same Normal(0, 1) distribution plus a linear trend added to all chains:

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  trend = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  trend <- conds[i, "trend"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r <- r + seq(-trend, trend, length.out = iters)
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)

If we don’t split chains, Rhat misses the trends if all chains still have a similar marginal distribution.

ggplot(data = res, aes(y = Rhat, x = trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Rhat without splitting chains')

Split-Rhat can detect trends, even if the marginals of the chains are similar.

ggplot(data = res, aes(y = zsRhat, x = trend)) + 
  geom_point() + geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is useful for detecting non-stationarity (i.e., trends) in the chains. If we use a threshold of \(1.01\), we can detect trends which account for 2% or more of the total marginal variance. If we use a threshold of \(1.1\), we detect trends which account for 30% or more of the total marginal variance.

The effective sample size is based on split Rhat and within-chain autocorrelation. We plot the relative efficiency \(R_{\rm eff}=S_{\rm eff}/S\) for easier comparison between different values of \(S\). In the plot below, dashed lines indicate the threshold at which we would consider the effective sample size to be sufficient (i.e., \(S_{\rm eff} > 400\)). Since we plot the relative efficiency instead of the effective sample size itself, this threshold is divided by \(S\), which we compute here as the number of iterations per chain (variable iter) times the number of chains (\(4\)).

ggplot(data = res, aes(y = zsreff, x = trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: Split-Rhat is more sensitive to trends for small sample sizes, but effective sample size becomes more sensitive for larger samples sizes (as autocorrelations can be estimated more accurately).

Advice: If in doubt, run longer chains for more accurate convergence diagnostics.

4.2.4 Shifting one chain

Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  shift = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  shift <- conds[i, "shift"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] + shift
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = shift)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: If we use a threshold of \(1.01\), we can detect shifts with a magnitude of one third or more of the marginal standard deviation. If we use a threshold of \(1.1\), we detect shifts with a magnitude equal to or larger than the marginal standard deviation.

ggplot(data = res, aes(y = zsreff, x = shift)) + 
  geom_point() +
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The effective sample size is not as sensitive, but a shift with a magnitude of half the marginal standard deviation or more will lead to very low relative efficiency when the total number of draws increases.

Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and a shift of 0.5.

iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Although, Rhat was less than \(1.05\) for this situation, the rank plots clearly show that the first chains behaves differently.

4.2.5 Scaling one chain

Next, we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has lower marginal variance.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  scale = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  scale <- conds[i, "scale"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] * scale
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)

We first look at the Rhat estimates:

ggplot(data = res, aes(y = zsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is not able to detect scale differences between chains.

ggplot(data = res, aes(y = zfsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Folded-split-Rhat')

Result: Folded-Split-Rhat focuses on scales and detects scale differences.

Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.

Next, we look at the effective sample size estimates:

ggplot(data = res, aes(y = zsreff, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The bulk effective sample size of the mean does not see a problem as it focuses on location differences between chains.

ggplot(data = res, aes(y = tailreff, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 0) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative tail-ESS (tailreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The tail effective sample size detects the difference in scales as it focuses on the scale of the chains.

Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.

iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Although folded Rhat is \(1.06\), the rank plots clearly show that the first chains behaves differently.

Appendix E: Cauchy: A distribution with infinite mean and variance

The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.

The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution

4.2.6 Nominal parameterization of Cauchy

We already looked at the nominal Cauchy model with direct parameterization in the main text, but for completeness, we take a closer look using different variants of the diagnostics.

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

4.2.6.1 Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1233 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.

Trace plots for the first parameter look wild with occasional large values:

samp <- as.array(fit_nom) 
mcmc_trace(samp[, , 1])

Let’s check Rhat and ESS diagnostics.

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold of 1.01. The rank normalized split-Rhat has several values over 1.01. Please note that the classic split-Rhat is not well defined in this example, because mean and variance of the Cauchy distribution are not finite.

plot_ess(res) 

Both classic and new effective sample size estimates have several very small values, and so the overall sample shouldn’t be trusted.

Result: Effective sample size is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.

We also check the log posterior value lp__ and find out that the effective sample size is worryingly low.

res <- monitor_extra(samp[, , 51:52]) 
cat('lp__: Bulk-ESS = ', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS =  117 
cat('lp__: Tail-ESS = ', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS =  322.59 

We can further analyze potential problems using local effective sample size and rank plots. We examine x[36], which has the smallest tail-ESS of 117.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates (see Section Efficiency for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)

We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.

An alternative way to examine the effective sample size in different parts of the posterior is to compute effective sample size for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of an indicator function’s results which indicate when the values are inside the specific interval.

plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)

We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.

We can further analyze potential problems using rank plots, from which we clearly see differences between chains.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

4.2.6.2 Default Stan options + increased maximum treedepth

We can try to improve the performance by increasing max_treedepth to \(20\):

fit_nom_td20 <- stan(
  file = 'cauchy_nom.stan', seed = 7878, 
  refresh = 0, control = list(max_treedepth = 20)
)

Trace plots for the first parameter still look wild with occasional large values.

samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)

We check the diagnostics for all \(x\).

plot_rhat(res)

All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in the scale than in the location between different chains.

plot_ess(res) 

Some classic effective sample sizes are very small. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of an infinite variance, zero relative efficiency (ESS/S) is the truth and longer chains won’t help with that. However other quantities can be well defined, and that’s why it is useful to also look at the rank normalized version as a generic transformation to achieve finite mean and variance. The smallest bulk-ESS is less than 1000, which is not that bad. The smallest median-ESS is larger than 2500, that is we are able to estimate the median efficiently. However, many tail-ESS’s are less than 400 indicating problems for estimating the scale of the posterior.

Result: The rank normalized effective sample size is more stable than classic effective sample size, which is not well defined for the Cauchy distribution.

Result: It is useful to look at both bulk- and tail-ESS.

We check also lp__. Although increasing max_treedepth improved bulk-ESS of x, the efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 240 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 586.95 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 20)

It seems that increasing max_treedepth has not much improved the efficiency in the tails. We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 40)

The rank plot visualisation of x[11], which has the smallest tail-ESS of NaN among the \(x\), indicates clear convergence problems.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__, which has an effective sample size 240, doesn’t look so good either.

mcmc_hist_r_scale(samp[, , "lp__"])

4.2.6.3 Default Stan options + increased maximum treedepth + longer chains

Let’s try running 8 times longer chains.

fit_nom_td20l <- stan(
  file = 'cauchy_nom.stan', seed = 7878, 
  refresh = 0, control = list(max_treedepth = 20), 
  warmup = 1000, iter = 9000
)
Warning: There were 7 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Trace plots for the first parameter still look wild with occasional large values.

samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)

Let’s check the diagnostics for all \(x\).

plot_rhat(res)

All Rhats are below \(1.01\). The classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined in this case).

plot_ess(res) 

Most classic ESS’s are close to zero. Running longer chains just made most classic ESS’s even smaller.

The smallest bulk-ESS are around 5000, which is not that bad. The smallest median-ESS’s are larger than 25000, that is we are able to estimate the median efficiently. However, the smallest tail-ESS is 919.37 indicating problems for estimating the scale of the posterior.

Result: The rank normalized effective sample size is more stable than classic effective sample size even for very long chains.

Result: It is useful to look at both bulk- and tail-ESS.

We also check lp__. Although increasing the number of iterations improved bulk-ESS of the \(x\), the relative efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1289 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1886.54 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 20)

Increasing the chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use a finer resolution for the local efficiency estimates.

plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)

The sampling efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of effective sample size for quantiles.

plot_quantile_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)

Let’s look at the rank plot visualisation of x[39], which has the smallest tail-ESS NaN among the \(x\).

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Increasing the number of iterations couldn’t remove the mixing problems at the tails. The mixing problem is inherent to the nominal parameterization of Cauchy distribution.

4.2.7 First alternative parameterization of the Cauchy distribution

Next, we examine an alternative parameterization and consider the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\) can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the alternative model:

fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)

There are no warnings and the sampling is much faster.

samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the Cauchy distribution.

plot_ess(res) 

Result: Rank normalized ESS’s have less variation than classic one which is not well defined for Cauchy.

We check lp__:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1310 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1927.69 

The relative efficiencies for lp__ are also much better than with the nominal parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 20)

The effective sample size is good in all parts of the posterior. We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 40)

We compare the mean relative efficiencies of the underlying parameters in the new parameterization and the actual \(x\) we are interested in.

res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean Bulk-ESS for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x = 3629.24 
cat('Mean Tail-ESS for x =' , round(mean(res[, 'tailseff']), 2), '\n')
Mean Tail-ESS for x = 2265.23 
cat('Mean Bulk-ESS for x_a =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_a = 3956.06 
cat('Mean Bulk-ESS for x_b =' , round(mean(res2[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_b = 2761.22 

Result: We see that the effective sample size of the interesting \(x\) can be different from the effective sample size of the parameters \(x_a\) and \(x_b\) that we used to compute it.

The rank plot visualisation of x[40], which has the smallest tail-ESS of 1822.64 among the \(x\) looks better than for the nominal parameterization.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Similarly, the rank plot visualisation of lp__, which has a relative efficiency of -81.34, 0.23, 8.08, -95.19, -80.99, -68.66, 1288, 0.32, 1303, 1296, 1310, 0.33, 1, 1, 1, 1, 1, 2366, 0.59, 1927.69, 0.48, 1708, 0.43, 2912, 0.73 looks better than for the nominal parameterization.

mcmc_hist_r_scale(samp[, , "lp__"])

4.2.8 Another alternative parameterization of the Cauchy distribution

Another alternative parameterization is obtained by a univariate transformation as shown in the following code (see also the 3rd alternative in Michael Betancourt’s case study).

writeLines(readLines("cauchy_alt_3.stan"))
parameters {
  vector<lower=0, upper=1>[50] x_tilde;
}

transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}

model {
  // Implicit uniform prior on x_tilde
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the alternative model:

fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the nominal model.

samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even though it is not well defined for the Cauchy distribution.

plot_ess(res) 

Result: Rank normalized relative efficiencies have less variation than classic ones. Bulk-ESS and median-ESS are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

We also take a closer look at the lp__ value:

res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1494 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1883.75 

The effective sample size for these are also much better than with the nominal parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 20)

We examine also the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 40)

The effective sample size in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal parameterization.

We compare the mean effective sample size of the underlying parameter in the new parameterization and the actually Cauchy distributed \(x\) we are interested in.

res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-seff for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean bulk-seff for x = 4702.98 
cat('Mean tail-seff for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean tail-seff for x = 1602.7 
cat('Mean bulk-seff for x_tilde =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean bulk-seff for x_tilde = 4702.98 
cat('Mean tail-seff for x_tilde =' , round(mean(res1[, 'zfsseff']), 2), '\n')
Mean tail-seff for x_tilde = 1612.14 

The Rank plot visualisation of x[5], which has the smallest tail-ESS of 1890.99 among the \(x\) reveals shows good efficiency, similar to the results for lp__.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

mcmc_hist_r_scale(samp[, , "lp__"])

4.2.9 Half-Cauchy distribution with nominal parameterization

Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the half-Cauchy model with nominal parameterization (and positive constraint).

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings and the sampling is much faster than for the full Cauchy distribution with nominal parameterization.

samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res) 

All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the half-Cauchy distribution.

plot_ess(res)  

Result: Rank normalized effective sample size have less variation than classic ones. Some Bulk-ESS and median-ESS are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

Due to the <lower=0> constraint, the sampling was made in the log(x) space, and we can also check the performance in that space.

res <- monitor_extra(log(samp[, , 1:50]))
plot_ess(res) 

\(\log(x)\) is quite close to Gaussian, and thus classic effective sample size is also close to rank normalized ESS which is exactly the same as for the original \(x\) as rank normalization is invariant to bijective transformations.

Result: The rank normalized effective sample size is close to the classic effective sample size for transformations which make the distribution close to Gaussian.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 20)

The effective sample size is good overall, with only a small dip in tails. We can also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 40)

The rank plot visualisation of x[32], which has the smallest tail-ESS of 1742 among \(x\), looks good.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__ reveals some small differences in the scales, but it’s difficult to know whether this small variation from uniform is relevant.

mcmc_hist_r_scale(samp[, , "lp__"])

4.2.10 Alternative parameterization of the half-Cauchy distribution

writeLines(readLines("half_cauchy_alt.stan"))
parameters {
  vector<lower=0>[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a .* sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ inv_gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run half-Cauchy with alternative parameterization

fit_half_reparam <- stan(
  file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)

There are no warnings and the sampling is as fast for the half-Cauchy nominal model.

samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

plot_ess(res) 

Result: The Rank normalized relative efficiencies have less variation than classic ones which is not well defined for the Cauchy distribution. Based on bulk-ESS and median-ESS, the efficiency for central quantities is much lower, but based on tail-ESS and MAD-ESS, the efficiency in the tails is slightly better than for the half-Cauchy distribution with nominal parameterization. We also see that a parameterization which is good for the full Cauchy distribution is not necessarily good for the half-Cauchy distribution as the <lower=0> constraint additionally changes the parameterization.

We also check the lp__ values:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 977 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1750.46 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 20)

We also examine the effective sample size for different quantile estimates.

plot_quantile_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 40)

The effective sample size near zero is much worse than for the half-Cauchy distribution with nominal parameterization.

The Rank plot visualisation of x[20], which has the smallest tail-ESS of NaN among the \(x\), reveals deviations from uniformity, which is expected with lower effective sample size.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

A similar result is obtained when looking at the rank plots of lp__.

mcmc_hist_r_scale(samp[, , "lp__"])

4.2.11 The Cauchy distribution with Jags

So far, we have run all models in Stan, but we want to also investigate whether similar problems arise with probabilistic programming languages that use other samplers than variants of Hamiltonian Monte-Carlo. Thus, we will fit the eight schools models also with Jags, which uses a dialect of the BUGS language to specify models. Jags uses a clever mix of Gibbs and Metropolis-Hastings sampling. This kind of sampling does not scale well to high dimensional posteriors of strongly interdependent parameters, but for the relatively simple models discussed in this case study it should work just fine.

The Jags code for the nominal parameteriztion of the cauchy distribution looks as follows:

writeLines(readLines("cauchy_nom.bugs"))
model {
  for (i in 1:50) {
    x[i] ~ dt(0, 1, 1)
  }
}

First, we initialize the Jags model for reusage later.

jags_nom <- jags.model(
  "cauchy_nom.bugs",
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 0
   Unobserved stochastic nodes: 50
   Total graph size: 52

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_nom <- coda.samples(
  jags_nom, variable.names = "x",
  n.iter = 1000
)
samp_jags_nom <- aperm(abind(samp_jags_nom, along = 3), c(1, 3, 2))
dimnames(samp_jags_nom)[[2]] <- paste0("chain:", 1:4)

We summarize the model as follows:

mon <- monitor(samp_jags_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

          Q5    Q50   Q95    Mean       SD  Rhat Bulk_ESS Tail_ESS
x[1]  -6.065 -0.011 6.368  -3.288  164.714 1.001     4374     4001
x[2]  -6.634 -0.003 6.245  -0.644   32.410 1.000     4092     3971
x[3]  -6.639  0.019 6.721   3.589  229.228 1.000     4092     3763
x[4]  -6.355 -0.056 5.638  -0.510   33.838 1.000     4457     3891
x[5]  -5.489 -0.020 6.375   0.181   58.609 1.001     4124     3885
x[6]  -6.355 -0.006 6.357   0.767   44.227 1.000     3804     3843
x[7]  -7.672 -0.034 6.916   0.762   65.339 1.000     4007     4016
x[8]  -6.368  0.018 6.494  -0.479   27.321 1.001     4069     3890
x[9]  -6.054 -0.009 6.803   1.547   68.922 1.000     3864     3972
x[10] -6.864  0.016 6.747  52.663 3292.059 1.001     3771     3974
x[11] -6.784  0.033 5.830   0.039   35.321 1.001     4204     4102
x[12] -5.745  0.035 5.992  -1.009  168.276 1.001     4076     4012
x[13] -6.211  0.045 6.651  -1.789   63.149 1.000     3795     3837
x[14] -6.108  0.025 6.136   0.725   57.857 1.001     3969     3972
x[15] -5.626 -0.019 6.077   0.657  135.151 1.000     4124     4099
x[16] -7.235 -0.018 6.043  -4.887  188.630 1.000     3865     3665
x[17] -6.014  0.038 5.928  -0.772   61.074 1.001     3836     3613
x[18] -6.192  0.015 5.778  -0.289   32.625 1.000     3971     4025
x[19] -5.833  0.018 6.533  -0.416   34.581 1.000     3693     3663
x[20] -6.336 -0.036 6.053  -0.673   73.409 1.001     4171     4015
x[21] -6.181 -0.030 5.674  -0.697   24.795 1.001     4053     3917
x[22] -5.852 -0.029 5.425   8.866  588.074 1.001     3405     3737
x[23] -5.877  0.002 6.059  -6.604  366.713 1.000     3648     3931
x[24] -6.208  0.018 6.222   1.601  123.503 1.001     3753     4013
x[25] -6.454 -0.007 6.691   4.273  254.640 1.001     4021     3822
x[26] -6.104  0.019 6.641   0.485   46.759 1.001     4117     3803
x[27] -6.657 -0.021 5.997   4.694  189.151 1.000     3658     3834
x[28] -5.738 -0.019 5.979  -0.431   45.203 1.000     4062     3933
x[29] -6.622 -0.001 6.738   0.363   80.103 1.000     3886     4024
x[30] -6.590  0.040 5.683   0.052   27.396 1.000     3686     3271
x[31] -6.247  0.004 5.946  -7.028  490.306 1.001     3382     4001
x[32] -6.491 -0.022 5.783   1.860   58.397 1.001     4101     3919
x[33] -5.972 -0.003 6.440   2.577  106.670 1.000     4088     3889
x[34] -6.417  0.053 7.271  -2.861  127.422 1.000     3823     3932
x[35] -6.484  0.021 5.877   0.576   44.268 1.001     3933     3814
x[36] -6.703 -0.048 5.432  -0.940   50.770 1.000     3519     3826
x[37] -6.403 -0.040 6.134  -1.857   62.055 1.000     3759     3930
x[38] -7.000 -0.004 6.331   0.259  135.908 1.000     4003     3678
x[39] -6.076 -0.007 5.916  -1.360   48.272 0.999     4273     3812
x[40] -6.047  0.008 6.534  -0.570   35.955 1.001     3598     3914
x[41] -5.617  0.033 6.945  -0.508  188.046 1.000     3831     3808
x[42] -6.911 -0.007 6.039  -1.020   27.088 1.001     4323     4095
x[43] -6.754 -0.003 6.726   1.535  101.810 1.001     3890     3513
x[44] -6.265 -0.019 6.944   3.971  159.152 1.001     3765     3890
x[45] -6.004  0.016 6.369  53.339 3338.077 1.000     3746     3648
x[46] -5.547 -0.027 5.845 -17.738  871.215 0.999     4136     3928
x[47] -6.161  0.027 6.389   1.013  326.707 1.000     4005     3785
x[48] -5.728  0.030 7.437   0.731   29.429 1.001     4063     3874
x[49] -6.796 -0.026 5.979  -0.238   48.647 1.000     3847     3852
x[50] -6.013  0.030 6.624  -1.144   46.479 1.001     3671     3752

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Bulk_ESS'])

The overall results look very promising with Rhats = 1 and ESS values close to the total number of draws of 4000. We take a detailed look at x[31], which has the smallest bulk-ESS of 3382.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.

plot_local_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 20)

The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 40)

Rank plots also look rather similar across chains.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp_jags_nom[, , xmin])

Result: Jags seems to be able to sample from the nominal parameterization of the Cauchy distribution just fine.

Appendix F: Hierarchical model: Eight Schools

We continue with our discussion about hierarchical models on the Eight Schools data, which we started in Section Eight Schools. We also analyse the performance of different variants of the diagnostics.

4.2.12 A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

In the main text, we observed that the centered parameterization of this hierarchical model did not work well with the default MCMC options of Stan plus increased adapt_delta, and so we directly try to fit the model with longer chains.

4.2.12.1 Centered parameterization with longer chains

Low efficiency can be sometimes compensated with longer chains. Let’s check 10 times longer chain.

fit_cp2 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 20000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 2335 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp2)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

              Q5     Q50    Q95    Mean    SD  Rhat Bulk_ESS Tail_ESS
mu        -0.994   4.836 10.339   4.875 3.566 1.048       71      189
tau        0.329   2.809 10.022   3.674 3.307 1.076       45       17
theta[1]  -1.362   6.432 16.320   6.760 5.639 1.013      407     9491
theta[2]  -2.478   5.472 12.555   5.416 4.804 1.024      153     9429
theta[3]  -4.923   4.664 11.525   4.405 5.434 1.030      117    10374
theta[4]  -2.886   5.337 12.396   5.233 4.971 1.026      140     9670
theta[5]  -4.484   4.317 10.691   4.091 4.937 1.039       89     4758
theta[6]  -4.148   4.697 11.274   4.490 5.079 1.028      118    11277
theta[7]  -0.883   6.596 15.527   6.829 5.108 1.011      449    11102
theta[8]  -3.465   5.411 13.337   5.337 5.493 1.022      172    10408
lp__     -24.943 -14.781  0.216 -13.837 7.586 1.068       50       86

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp2)
print(res)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

            mean se_mean    sd      Q5     Q50    Q95 seff  reff sseff zseff zsseff zsreff  Rhat
mu         4.875   0.490 3.566  -0.994   4.836 10.339   53 0.001    71    54     71  0.002 1.050
tau        3.674   0.298 3.307   0.329   2.809 10.022  123 0.003   173    35     45  0.001 1.025
theta[1]   6.760   0.219 5.639  -1.362   6.432 16.320  666 0.017  1057   281    407  0.010 1.008
theta[2]   5.416   0.431 4.804  -2.478   5.472 12.555  124 0.003   169   113    153  0.004 1.022
theta[3]   4.405   0.530 5.434  -4.923   4.664 11.525  105 0.003   146    86    117  0.003 1.026
theta[4]   5.233   0.458 4.971  -2.886   5.337 12.396  118 0.003   163   105    140  0.004 1.024
theta[5]   4.091   0.566 4.937  -4.484   4.317 10.691   76 0.002   102    69     89  0.002 1.034
theta[6]   4.490   0.508 5.079  -4.148   4.697 11.274  100 0.002   137    87    118  0.003 1.025
theta[7]   6.829   0.226 5.108  -0.883   6.596 15.527  512 0.013   745   309    449  0.011 1.009
theta[8]   5.337   0.432 5.493  -3.465   5.411 13.337  162 0.004   231   125    172  0.004 1.018
lp__     -13.837   1.321 7.586 -24.943 -14.781  0.216   33 0.001    44    37     50  0.001 1.081
         sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
mu       1.050 1.048  1.048   1.023     152   0.004      189    0.005      174    0.004      174
tau      1.023 1.077  1.076   1.008    1035   0.026       17    0.000      175    0.004      167
theta[1] 1.008 1.012  1.013   1.001    3297   0.082     9491    0.237      177    0.004      268
theta[2] 1.022 1.024  1.024   1.005    1817   0.045     9429    0.236      173    0.004      177
theta[3] 1.025 1.031  1.030   1.008     736   0.018    10374    0.259      168    0.004      172
theta[4] 1.023 1.026  1.026   1.005    1468   0.037     9670    0.242      167    0.004      167
theta[5] 1.035 1.037  1.039   1.012     375   0.009     4758    0.119      170    0.004      178
theta[6] 1.025 1.028  1.028   1.009     644   0.016    11277    0.282      176    0.004      176
theta[7] 1.008 1.011  1.011   1.000    2761   0.069    11102    0.278      179    0.004      852
theta[8] 1.017 1.022  1.022   1.003    2816   0.070    10408    0.260      166    0.004      191
lp__     1.080 1.071  1.068   1.059      55   0.001       86    0.002      170    0.004      157
         madsreff
mu          0.004
tau         0.004
theta[1]    0.007
theta[2]    0.004
theta[3]    0.004
theta[4]    0.004
theta[5]    0.004
theta[6]    0.004
theta[7]    0.021
theta[8]    0.005
lp__        0.004

We still get a whole bunch of divergent transitions so it’s clear that the results can’t be trusted even if all other diagnostics were good. Still, it may be worth looking at additional diagnostics to better understand what’s happening.

Some rank-normalized split-Rhats are still larger than \(1.01\). Bulk-ESS for tau and lp__ are around 800 which corresponds to low relative efficiency of \(1\%\), but is above our recommendation of ESS>400. In this kind of cases, it is useful to look at the local efficiency estimates, too (and the larger number of divergences is clear indication of problems, too).

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small intervals for tau.

plot_local_ess(fit = fit_cp2, par = "tau", nalpha = 50)

We see that the sampling has difficulties in exploring small tau values. As ESS<400 for small probability intervals in case of small tau values, we may suspect that we may miss substantial amount of posterior mass and get biased estimates.

We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_cp2, par = "tau", nalpha = 100)

Several quantile estimates have ESS<400, which raises a doubt that there are convergence problems and we may have significant bias.

Let’s see how the Bulk-ESS and Tail-ESS changes when we use more and more draws.

plot_change_ess(fit = fit_cp2, par = "tau")

We see that given recommendation that Bulk-ESS>400 and Tail-ESS>400, they are not sufficient to detect convergence problems in this case, even the tail quantile estimates are able to detect these problems.

The rank plot visualisation of tau also shows clear sticking and mixing problems.

samp_cp2 <- as.array(fit_cp2)
mcmc_hist_r_scale(samp_cp2[, , "tau"])

Similar results are obtained for lp__, which is closely connected to tau for this model.

mcmc_hist_r_scale(samp_cp2[, , "lp__"])

We may also examine small interval efficiencies for mu.

plot_local_ess(fit = fit_cp2, par = "mu", nalpha = 50)

There are gaps of poor efficiency which again indicates problems in the mixing of the chains. However, these problems do not occur for any specific range of values of mu as was the case for tau. This tells us that it’s probably not mu with which the sampler has problems, but more likely tau or a related quantity.

As we observed divergences, we shouldn’t trust any Monte Carlo standard error (MCSE) estimates as they are likely biased, as well. However, for illustration purposes, we compute the MCSE, tail quantiles and corresponding effective sample sizes for the median of mu and tau. Comparing to the shorter MCMC run, using 10 times more draws has not reduced the MCSE to one third as would be expected without problems in the mixing of the chains.

round(quantile_mcse(samp_cp2[ , , "mu"], prob = 0.5), 2)
  mcse  Q05  Q95   Seff
1 0.37 4.22 5.43 173.52
round(quantile_mcse(samp_cp2[ , , "tau"], prob = 0.5), 2)
  mcse  Q05  Q95   Seff
1 0.27 2.38 3.27 174.86

4.2.12.2 Centered parameterization with very long chains

For further evidence, let’s check 100 times longer chains than the default. This is not something we would recommend doing in practice, as it is not able to solve any problems with divergences as illustrated below.

fit_cp3 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 200000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 11699 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp3)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):

              Q5     Q50    Q95    Mean    SD  Rhat Bulk_ESS Tail_ESS
mu        -1.097   4.371  9.829   4.371 3.326 1.001    18335    30265
tau        0.474   2.944 10.043   3.797 3.205 1.001     2200      769
theta[1]  -1.588   5.733 16.380   6.290 5.694 1.000    23832   110854
theta[2]  -2.528   4.850 12.763   4.944 4.756 1.000    27789   136002
theta[3]  -5.055   4.086 11.880   3.866 5.361 1.000    39355   122761
theta[4]  -2.950   4.684 12.644   4.749 4.862 1.000    32607   138545
theta[5]  -4.553   3.786 10.766   3.552 4.715 1.000    34479    44492
theta[6]  -4.164   4.156 11.618   4.011 4.907 1.000    37000    92227
theta[7]  -1.025   5.922 15.636   6.380 5.161 1.000    20685    58049
theta[8]  -3.490   4.740 13.507   4.854 5.393 1.000    36212   125498
lp__     -24.983 -15.148 -2.076 -14.584 6.871 1.001     2541     1074

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp3)
print(res)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):

            mean se_mean    sd      Q5     Q50    Q95  seff  reff sseff zseff zsseff zsreff  Rhat
mu         4.371   0.024 3.326  -1.097   4.371  9.829 18435 0.046 18411 18358  18335  0.046 1.000
tau        3.797   0.033 3.205   0.474   2.944 10.043  9355 0.023  9391  2206   2200  0.006 1.000
theta[1]   6.290   0.032 5.694  -1.588   5.733 16.380 30955 0.077 31120 23846  23832  0.060 1.000
theta[2]   4.944   0.026 4.756  -2.528   4.850 12.763 32922 0.082 33013 27715  27789  0.069 1.000
theta[3]   3.866   0.023 5.361  -5.055   4.086 11.880 53609 0.134 53632 39181  39355  0.098 1.000
theta[4]   4.749   0.024 4.862  -2.950   4.684 12.644 39472 0.099 39723 32535  32607  0.082 1.000
theta[5]   3.552   0.023 4.715  -4.553   3.786 10.766 41358 0.103 41819 34256  34479  0.086 1.000
theta[6]   4.011   0.023 4.907  -4.164   4.156 11.618 45877 0.115 46105 36785  37000  0.092 1.000
theta[7]   6.380   0.033 5.161  -1.025   5.922 15.636 24701 0.062 24691 20682  20685  0.052 1.000
theta[8]   4.854   0.024 5.393  -3.490   4.740 13.507 50166 0.125 50212 36407  36212  0.091 1.000
lp__     -14.584   0.140 6.871 -24.983 -15.148 -2.076  2414 0.006  2410  2545   2541  0.006 1.001
         sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
mu       1.001 1.000  1.001   1.000   28848   0.072    30265    0.076    16309    0.041    18276
tau      1.000 1.001  1.001   1.000   32860   0.082      769    0.002    10904    0.027    15361
theta[1] 1.000 1.000  1.000   1.000   40586   0.101   110854    0.277    15333    0.038    18846
theta[2] 1.000 1.000  1.000   1.000   41788   0.104   136002    0.340    15135    0.038    18463
theta[3] 1.000 1.000  1.000   1.000   31386   0.078   122761    0.307    16680    0.042    22054
theta[4] 1.000 1.000  1.000   1.000   38705   0.097   138545    0.346    16525    0.041    19038
theta[5] 1.000 1.000  1.000   1.000   37235   0.093    44492    0.111    16593    0.041    18887
theta[6] 1.000 1.000  1.000   1.000   36568   0.091    92227    0.231    16393    0.041    18180
theta[7] 1.000 1.000  1.000   1.000   35129   0.088    58049    0.145    14827    0.037    17370
theta[8] 1.000 1.000  1.000   1.000   38924   0.097   125498    0.314    15431    0.039    16092
lp__     1.001 1.001  1.001   1.001    2942   0.007     1074    0.003    10102    0.025    13775
         madsreff
mu          0.046
tau         0.038
theta[1]    0.047
theta[2]    0.046
theta[3]    0.055
theta[4]    0.048
theta[5]    0.047
theta[6]    0.045
theta[7]    0.043
theta[8]    0.040
lp__        0.034

Rhat, Bulk-ESS and Tail-ESS are not able to detect problems, although Tail-ESS for tau is suspiciously low compared to total number of draws.

plot_local_ess(fit = fit_cp3, par = "tau", nalpha = 100)

plot_quantile_ess(fit = fit_cp3, par = "tau", nalpha = 100)

And the rank plots of tau also show sticking and mixing problems for small values of tau.

samp_cp3 <- as.array(fit_cp3)
mcmc_hist_r_scale(samp_cp3[, , "tau"])

What we do see is an advantage of rank plots over trace plots as even with 100000 draws per chain, rank plots don’t get crowded and the mixing problems of chains is still easy to see.

With centered parameterization the mean estimate tends to get smaller with more draws. With 400000 draws using the centered parameterization the mean estimate is 3.77 (se 0.03). With 40000 draws using the non-centered parameterization the mean estimate is 3.6 (se 0.02). The difference is more than 8 sigmas. We are able to see the convergence problems in the centered parameterization case, if we do look carefully (or use divergence diagnostic ), but we do see that Rhat, Bulk-ESS, Tail-ESS and Monte Carlo error estimates for the mean can’t be trusted if other diagnostics indicate convergence problems!

4.2.12.3 Centered parameterization with very long chains and thinning

When autocorrelation time is high, it has been common to thin the chains by saving only a small portion of the draws. This will throw away useful information also for convergence diagnostics. With 400000 iterations per chain, thinning of 200 and 4 chains, we again end up with 4000 iterations as with the default settings.

fit_cp4 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 400000, thin = 200, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 93 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

We observe several divergent transitions and the estimated Bayesian fraction of missing information is also low, which indicate convergence problems and potentially biased estimates.

Unfortunately the thinning makes Rhat and ESS estimates to miss the problems. The posterior mean is still biased, being more than 3 sigmas away from the estimate obtained using non-centered parameterization.

monitor(fit_cp4)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):

              Q5     Q50    Q95    Mean    SD  Rhat Bulk_ESS Tail_ESS
mu        -0.909   4.461  9.732   4.405 3.243 1.000     3784     3648
tau        0.462   2.890 10.031   3.750 3.160 1.001     3625     2447
theta[1]  -1.664   5.628 16.162   6.240 5.737 1.000     4101     3691
theta[2]  -2.166   4.844 12.614   5.042 4.619 1.000     3950     3946
theta[3]  -4.544   4.165 11.876   3.976 5.206 1.000     4121     3819
theta[4]  -3.017   4.727 12.425   4.747 4.834 1.000     4026     4188
theta[5]  -4.383   3.752 10.560   3.553 4.685 1.001     3790     3839
theta[6]  -3.762   4.296 11.814   4.182 4.859 1.000     4057     4059
theta[7]  -0.958   5.914 15.396   6.340 4.997 1.001     4154     3813
theta[8]  -3.545   4.642 13.475   4.783 5.328 1.000     4040     3968
lp__     -25.059 -15.017 -1.638 -14.424 6.988 1.002     3689     2616

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp4)
print(res)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):

            mean se_mean    sd      Q5     Q50    Q95 seff  reff sseff zseff zsseff zsreff  Rhat
mu         4.405   0.053 3.243  -0.909   4.461  9.732 3737 0.934  3779  3741   3784  0.946 1.000
tau        3.750   0.050 3.160   0.462   2.890 10.031 4006 1.002  4005  3625   3625  0.906 1.000
theta[1]   6.240   0.090 5.737  -1.664   5.628 16.162 4064 1.016  4071  4096   4101  1.025 1.000
theta[2]   5.042   0.074 4.619  -2.166   4.844 12.614 3924 0.981  3935  3939   3950  0.988 1.001
theta[3]   3.976   0.081 5.206  -4.544   4.165 11.876 4097 1.024  4104  4115   4121  1.030 1.000
theta[4]   4.747   0.077 4.834  -3.017   4.727 12.425 3961 0.990  4006  3971   4026  1.006 1.000
theta[5]   3.553   0.077 4.685  -4.383   3.752 10.560 3720 0.930  3810  3742   3790  0.948 1.000
theta[6]   4.182   0.077 4.859  -3.762   4.296 11.814 3945 0.986  4028  4010   4057  1.014 1.000
theta[7]   6.340   0.078 4.997  -0.958   5.914 15.396 4118 1.030  4127  4144   4154  1.038 1.000
theta[8]   4.783   0.085 5.328  -3.545   4.642 13.475 3968 0.992  3989  4014   4040  1.010 1.000
lp__     -14.424   0.118 6.988 -25.059 -15.017 -1.638 3505 0.876  3563  3686   3689  0.922 1.000
         sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
mu       1.000 1.000  1.000   1.000    3655   0.914     3648    0.912     4193    1.048     3805
tau      1.001 1.000  1.001   1.000    4095   1.024     2447    0.612     3955    0.989     3815
theta[1] 1.000 1.000  1.000   0.999    4200   1.050     3691    0.923     4115    1.029     3902
theta[2] 1.000 1.001  1.000   0.999    4049   1.012     3946    0.987     3556    0.889     4059
theta[3] 1.000 1.000  1.000   1.000    3810   0.952     3819    0.955     4075    1.019     3810
theta[4] 1.000 1.000  1.000   1.000    3919   0.980     4188    1.047     3496    0.874     3885
theta[5] 1.000 1.000  1.000   1.001    3582   0.896     3839    0.960     3834    0.958     3842
theta[6] 0.999 1.000  0.999   1.000    3882   0.970     4059    1.015     4005    1.001     3820
theta[7] 1.001 1.000  1.001   1.000    4061   1.015     3813    0.953     4259    1.065     3871
theta[8] 1.000 1.000  1.000   0.999    3750   0.938     3968    0.992     4139    1.035     3826
lp__     1.001 1.000  1.001   1.002    3192   0.798     2616    0.654     3835    0.959     3914
         madsreff
mu          0.951
tau         0.954
theta[1]    0.976
theta[2]    1.015
theta[3]    0.952
theta[4]    0.971
theta[5]    0.960
theta[6]    0.955
theta[7]    0.968
theta[8]    0.956
lp__        0.978

Various diagnostic plots of tau look reasonable as well.

plot_local_ess(fit = fit_cp4, par = "tau", nalpha = 100)

plot_quantile_ess(fit = fit_cp4, par = "tau", nalpha = 100)

plot_change_ess(fit = fit_cp4, par = "tau")

However, the rank plots seem still to show the problem.

samp_cp4 <- as.array(fit_cp4)
mcmc_hist_r_scale(samp_cp4[, , "tau"])

4.2.13 Non-centered Eight Schools model

In the following, we want to expand our understanding of the non-centered parameterization of the hierarchical model fit to the eight schools data.

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

4.2.13.1 Non-centered parameterization with default MCMC options

In the main text, we have already seen that the non-centered parameterization works better than the centered parameterization, at least when we use an increased adapt_delta value. Let’s see what happens when using the default MCMC option of Stan.

fit_ncp <- stan(
  file = 'eight_schools_ncp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0
)
Warning: There were 2 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems

We observe a few divergent transitions with the default of adapt_delta=0.8. Let’s analyze the sample.

monitor(fit_ncp)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                    Q5    Q50    Q95   Mean    SD  Rhat Bulk_ESS Tail_ESS
mu              -0.980  4.412  9.515  4.376 3.240 1.001     4083     2378
tau              0.248  2.768  9.768  3.608 3.156 1.000     2303     1795
theta_tilde[1]  -1.310  0.352  1.883  0.322 0.971 1.002     4571     2604
theta_tilde[2]  -1.411  0.135  1.643  0.120 0.916 1.000     5771     3078
theta_tilde[3]  -1.620 -0.102  1.486 -0.089 0.957 1.002     4966     3054
theta_tilde[4]  -1.434  0.033  1.514  0.046 0.905 1.000     5442     2830
theta_tilde[5]  -1.668 -0.172  1.353 -0.155 0.913 1.001     4273     3005
theta_tilde[6]  -1.635 -0.084  1.478 -0.075 0.945 1.001     5192     2981
theta_tilde[7]  -1.248  0.385  1.883  0.362 0.972 1.000     3898     2800
theta_tilde[8]  -1.509  0.073  1.676  0.079 0.970 1.000     4848     2863
theta[1]        -1.379  5.678 15.842  6.270 5.601 1.000     3790     2549
theta[2]        -2.288  4.877 12.826  5.034 4.621 1.001     5002     2920
theta[3]        -4.276  4.079 11.901  3.948 5.239 1.000     4001     3036
theta[4]        -2.736  4.658 12.088  4.641 4.634 1.000     4699     3063
theta[5]        -4.135  3.894 10.453  3.630 4.537 1.000     4310     3184
theta[6]        -4.110  4.188 11.303  3.955 4.878 1.000     4965     2806
theta[7]        -0.845  5.857 15.176  6.284 4.943 1.001     4599     3296
theta[8]        -3.243  4.771 13.518  4.914 5.371 1.000     4461     3288
lp__           -11.062 -6.469 -3.684 -6.814 2.297 1.001     1711     2385

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_ncp)
print(res)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                 mean se_mean    sd      Q5    Q50    Q95 seff  reff sseff zseff zsseff zsreff
mu              4.376   0.051 3.240  -0.980  4.412  9.515 4023 1.006  4036  4070   4083  1.021
tau             3.608   0.061 3.156   0.248  2.768  9.768 2684 0.671  2700  2293   2303  0.576
theta_tilde[1]  0.322   0.014 0.971  -1.310  0.352  1.883 4528 1.132  4566  4532   4571  1.143
theta_tilde[2]  0.120   0.012 0.916  -1.411  0.135  1.643 5742 1.436  5758  5754   5771  1.443
theta_tilde[3] -0.089   0.014 0.957  -1.620 -0.102  1.486 4929 1.232  4974  4919   4966  1.242
theta_tilde[4]  0.046   0.012 0.905  -1.434  0.033  1.514 5370 1.342  5436  5376   5442  1.361
theta_tilde[5] -0.155   0.014 0.913  -1.668 -0.172  1.353 4222 1.056  4269  4226   4273  1.068
theta_tilde[6] -0.075   0.013 0.945  -1.635 -0.084  1.478 5182 1.296  5195  5179   5192  1.298
theta_tilde[7]  0.362   0.016 0.972  -1.248  0.385  1.883 3885 0.971  3888  3893   3898  0.974
theta_tilde[8]  0.079   0.014 0.970  -1.509  0.073  1.676 4838 1.210  4853  4834   4848  1.212
theta[1]        6.270   0.095 5.601  -1.379  5.678 15.842 3504 0.876  3530  3767   3790  0.948
theta[2]        5.034   0.066 4.621  -2.288  4.877 12.826 4892 1.223  4928  4965   5002  1.250
theta[3]        3.948   0.085 5.239  -4.276  4.079 11.901 3796 0.949  3825  3971   4001  1.000
theta[4]        4.641   0.069 4.634  -2.736  4.658 12.088 4554 1.139  4577  4676   4699  1.175
theta[5]        3.630   0.071 4.537  -4.135  3.894 10.453 4126 1.032  4174  4258   4310  1.077
theta[6]        3.955   0.071 4.878  -4.110  4.188 11.303 4726 1.182  4815  4922   4965  1.241
theta[7]        6.284   0.074 4.943  -0.845  5.857 15.176 4416 1.104  4423  4524   4599  1.150
theta[8]        4.914   0.084 5.371  -3.243  4.771 13.518 4046 1.012  4066  4439   4461  1.115
lp__           -6.814   0.056 2.297 -11.062 -6.469 -3.684 1678 0.420  1684  1704   1711  0.428
                Rhat sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff
mu             1.000 0.999 1.000  0.999   1.001    1775   0.444     2378    0.595     4237    1.059
tau            1.000 1.000 1.000  1.000   1.000    3132   0.783     1795    0.449     3151    0.788
theta_tilde[1] 1.000 1.000 1.000  1.000   1.002    2141   0.535     2604    0.651     4489    1.122
theta_tilde[2] 1.000 0.999 1.000  0.999   1.000    1984   0.496     3078    0.769     5350    1.337
theta_tilde[3] 1.000 1.000 1.000  1.000   1.002    2136   0.534     3054    0.763     5397    1.349
theta_tilde[4] 1.000 1.000 1.000  1.000   1.000    2008   0.502     2830    0.708     4821    1.205
theta_tilde[5] 1.000 1.000 1.000  1.000   1.001    2263   0.566     3005    0.751     4282    1.070
theta_tilde[6] 1.000 1.000 1.000  1.000   1.001    2078   0.520     2981    0.745     4760    1.190
theta_tilde[7] 1.000 1.000 1.000  1.000   1.000    2260   0.565     2800    0.700     3823    0.956
theta_tilde[8] 1.000 1.000 1.000  1.000   1.000    1905   0.476     2863    0.716     4351    1.088
theta[1]       1.001 1.001 1.000  1.000   1.000    2528   0.632     2549    0.637     4358    1.089
theta[2]       1.000 1.000 1.000  1.000   1.001    2302   0.576     2920    0.730     4229    1.057
theta[3]       1.000 1.000 1.000  1.000   1.000    2553   0.638     3036    0.759     4002    1.000
theta[4]       1.000 0.999 1.000  0.999   1.000    2654   0.664     3063    0.766     4548    1.137
theta[5]       1.000 1.000 1.000  1.000   1.000    2783   0.696     3184    0.796     4523    1.131
theta[6]       1.000 0.999 1.000  0.999   1.000    2666   0.666     2806    0.701     4759    1.190
theta[7]       1.000 1.001 1.000  1.000   1.001    2493   0.623     3296    0.824     4315    1.079
theta[8]       1.000 1.000 1.000  0.999   1.000    2472   0.618     3288    0.822     4400    1.100
lp__           1.001 1.000 1.001  1.001   1.000    2487   0.622     2385    0.596     1990    0.498
               madsseff madsreff
mu                 2343    0.586
tau                3171    0.793
theta_tilde[1]     2491    0.623
theta_tilde[2]     2406    0.602
theta_tilde[3]     2447    0.612
theta_tilde[4]     2419    0.605
theta_tilde[5]     2692    0.673
theta_tilde[6]     2412    0.603
theta_tilde[7]     2783    0.696
theta_tilde[8]     2000    0.500
theta[1]           2788    0.697
theta[2]           2542    0.636
theta[3]           2971    0.743
theta[4]           2859    0.715
theta[5]           2689    0.672
theta[6]           3152    0.788
theta[7]           2911    0.728
theta[8]           2644    0.661
lp__               2796    0.699

All Rhats are close to 1, and ESSs are good despite a few divergent transitions. Small interval and quantile plots of tau reveal some sampling problems for small tau values, but not nearly as strong as for the centered parameterization.

plot_local_ess(fit = fit_ncp, par = "tau", nalpha = 20)

plot_quantile_ess(fit = fit_ncp, par = "tau", nalpha = 40)

Overall, the non-centered parameterization looks good even for the default settings of adapt_delta, and increasing it to 0.95 gets rid of the last remaining problems. This stands in sharp contrast to what we observed for the centered parameterization, where increasing adapt_delta didn’t help at all. Actually, this is something we observe quite often: A suboptimal parameterization can cause problems that are not simply solved by tuning the sampler. Instead, we have to adjust our model to achieve trustworthy inference.

4.2.14 Eight Schools with Jags

We will also run the centered and non-centered parameterizations of the eight schools model with Jags.

4.2.14.1 Centered Eight Schools Model

The Jags code for the centered eight schools model looks as follows:

writeLines(readLines("eight_schools_cp.bugs"))
model {
  for (j in 1:J) {
    sigma_prec[j] <- pow(sigma[j], -2)
    theta[j] ~ dnorm(mu, tau_prec)
    y[j] ~ dnorm(theta[j], sigma_prec[j])
  }
  mu ~ dnorm(0, pow(5, -2))
  tau ~ dt(0, pow(5, -2), 1)T(0, )
  tau_prec <- pow(tau, -2)
}

First, we initialize the Jags model for reusage later.

jags_cp <- jags.model(
  "eight_schools_cp.bugs",
  data = eight_schools,
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 8
   Unobserved stochastic nodes: 10
   Total graph size: 40

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_cp <- coda.samples(
  jags_cp, c("theta", "mu", "tau"),
  n.iter = 1000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))

Convergence diagnostics indicate problems in the sampling of mu and tau, but also to a lesser degree in all other paramters.

mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

             Q5   Q50    Q95  Mean    SD  Rhat Bulk_ESS Tail_ESS
mu       -0.671 4.522  9.701 4.590 3.134 1.028      151      315
tau       0.376 2.754  9.189 3.457 2.846 1.054       79      153
theta[1] -1.320 5.575 15.333 6.237 5.237 1.020      406      819
theta[2] -1.855 4.840 12.649 5.142 4.512 1.024      372     1077
theta[3] -4.399 4.365 11.883 4.232 4.975 1.029      267      963
theta[4] -2.646 4.796 12.578 4.961 4.613 1.022      380     1189
theta[5] -4.185 4.101 10.479 3.744 4.473 1.027      193      781
theta[6] -3.587 4.407 11.162 4.275 4.558 1.020      225      822
theta[7] -0.749 5.659 14.801 6.361 4.856 1.018      372      666
theta[8] -2.825 4.819 13.094 5.006 4.945 1.028      441     1073

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

We also see problems in the sampling of tau using various diagnostic plots.

plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)

plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)

plot_change_ess(samp_jags_cp, par = "tau")

Let’s see what happens if we run 10 times longer chains.

samp_jags_cp <- coda.samples(
  jags_cp, c("theta", "mu", "tau"),
  n.iter = 10000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))

Convergence looks better now, although tau is still estimated not very efficiently.

mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

             Q5   Q50    Q95  Mean    SD  Rhat Bulk_ESS Tail_ESS
mu       -1.402 4.211  9.782 4.244 3.377 1.004     1175     1187
tau       0.244 2.854  9.899 3.673 3.204 1.005      653      700
theta[1] -1.880 5.552 16.111 6.083 5.665 1.002     1826     1622
theta[2] -2.750 4.668 12.567 4.797 4.738 1.002     2145     2967
theta[3] -4.968 3.956 11.759 3.749 5.328 1.002     2229     7328
theta[4] -3.151 4.486 12.481 4.589 4.864 1.003     2107     4318
theta[5] -4.431 3.665 10.683 3.474 4.723 1.002     1956     4772
theta[6] -4.076 3.987 11.423 3.877 4.885 1.002     2199     7352
theta[7] -1.355 5.730 15.548 6.229 5.201 1.002     1613     1288
theta[8] -3.516 4.614 13.379 4.747 5.385 1.002     2316     7483

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

The diagnostic plots of quantiles and small intervals tell a similar story.

plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)

plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)

Notably, however, the increase in effective sample size of tau is linear in the total number of draws indicating that convergence for tau may be achieved by simply running longer chains.

plot_change_ess(samp_jags_cp, par = "tau")

Result: Similar to Stan, Jags also has convergence problems with the centered parameterization of the eight schools model.

4.2.14.2 Non-Centered Eight Schools Model

The Jags code for the non-centered eight schools model looks as follows:

writeLines(readLines("eight_schools_ncp.bugs"))
model {
  for (j in 1:J) {
    sigma_prec[j] <- pow(sigma[j], -2)
    theta_tilde[j] ~ dnorm(0, 1)
    theta[j] = mu + tau * theta_tilde[j]
    y[j] ~ dnorm(theta[j], sigma_prec[j])
  }
  mu ~ dnorm(0, pow(5, -2))
  tau ~ dt(0, pow(5, -2), 1)T(0, )
}

First, we initialize the Jags model for reusage later.

jags_ncp <- jags.model(
  "eight_schools_ncp.bugs",
  data = eight_schools,
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 8
   Unobserved stochastic nodes: 10
   Total graph size: 55

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_ncp <- coda.samples(
  jags_ncp, c("theta", "mu", "tau"),
  n.iter = 1000
)
samp_jags_ncp <- aperm(abind(samp_jags_ncp, along = 3), c(1, 3, 2))

Convergence diagnostics indicate much better mixing than for the centered eight school model.

mon <- monitor(samp_jags_ncp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

             Q5   Q50    Q95  Mean    SD  Rhat Bulk_ESS Tail_ESS
mu       -1.030 4.336  9.801 4.373 3.336 1.000     3035     3356
tau       0.276 2.871 10.338 3.837 3.528 1.002     1107      969
theta[1] -1.665 5.839 16.769 6.450 5.823 1.001     2997     2094
theta[2] -2.466 5.009 12.703 5.013 4.737 1.000     3593     3485
theta[3] -5.402 4.120 11.958 3.788 5.513 1.001     3465     2685
theta[4] -3.125 4.728 12.598 4.755 4.919 1.000     4097     3615
theta[5] -4.789 3.894 11.052 3.607 4.945 1.000     3684     3324
theta[6] -4.137 4.208 11.482 4.050 4.851 1.000     3879     3076
theta[7] -1.118 5.962 15.601 6.412 5.217 1.001     2728     2724
theta[8] -3.796 4.801 13.514 4.844 5.390 1.000     4150     3442

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

Specifically, the mixing of tau looks much better although we still see some problems in the estimation of larger quantiles.

plot_local_ess(samp_jags_ncp, par = "tau", nalpha = 20)

plot_quantile_ess(samp_jags_ncp, par = "tau", nalpha = 20)

Change in effective sample size is roughly linear indicating that some remaining convergence problems are likely to be solved by running longer chains.

plot_change_ess(samp_jags_ncp, par = "tau")

Result: Similar to Stan, Jags can sample from the non-centered parameterization of theeight schools model much better than from the centered parameterization.

Appendix G: Dynamic HMC and effective sample size

We have already seen that the effective sample size of dynamic HMC can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to the properties of dynamic HMC algorithms.

We sample from a simple 16-dimensional standard normal model.

writeLines(readLines("normal.stan"))
data {
  int<lower=1> J;
}
parameters {
  vector[J] x;
}
model {
  x ~ normal(0, 1);
}
fit_n <- stan(
  file = 'normal.stan', data = data.frame(J = 16),
  iter = 20000, chains = 4, seed = 483892929, refresh = 0 
)
samp <- as.array(fit_n)
monitor(samp)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

           Q5    Q50    Q95   Mean    SD  Rhat Bulk_ESS Tail_ESS
x[1]   -1.657 -0.002  1.647 -0.002 1.003     1    98264    28709
x[2]   -1.641 -0.006  1.644 -0.003 0.996     1    95812    29664
x[3]   -1.634  0.002  1.620  0.003 0.990     1    98640    28669
x[4]   -1.648  0.005  1.663  0.006 1.007     1    97302    29166
x[5]   -1.640 -0.001  1.634  0.002 0.999     1   101542    29930
x[6]   -1.647 -0.001  1.646 -0.002 1.001     1    96292    28376
x[7]   -1.628  0.008  1.629  0.003 0.993     1    96016    29238
x[8]   -1.646 -0.007  1.650 -0.005 1.002     1   100375    29893
x[9]   -1.640  0.006  1.655  0.004 0.998     1   101141    28621
x[10]  -1.624 -0.007  1.630 -0.002 0.990     1   103126    29411
x[11]  -1.645  0.006  1.661  0.004 1.000     1    95886    28488
x[12]  -1.619  0.005  1.626  0.006 0.993     1    98433    29228
x[13]  -1.621  0.010  1.650  0.003 0.990     1    98181    27421
x[14]  -1.629 -0.002  1.625 -0.003 0.989     1    97313    27507
x[15]  -1.625  0.005  1.643  0.006 0.992     1    95223    29139
x[16]  -1.660  0.003  1.645 -0.001 1.006     1    99980    29639
lp__  -13.000 -7.660 -3.923 -7.951 2.790     1    14489    19627

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

        mean se_mean    sd      Q5    Q50    Q95   seff  reff  sseff  zseff zsseff zsreff  Rhat
x[1]  -0.002   0.003 1.003  -1.657 -0.002  1.647  97846 2.446  98426  97687  98264  2.457     1
x[2]  -0.003   0.003 0.996  -1.641 -0.006  1.644  95437 2.386  95717  95531  95812  2.395     1
x[3]   0.003   0.003 0.990  -1.634  0.002  1.620  98375 2.459  98703  98313  98640  2.466     1
x[4]   0.006   0.003 1.007  -1.648  0.005  1.663  96642 2.416  97290  96654  97302  2.433     1
x[5]   0.002   0.003 0.999  -1.640 -0.001  1.634 101317 2.533 101514 101344 101542  2.539     1
x[6]  -0.002   0.003 1.001  -1.647 -0.001  1.646  96011 2.400  96299  96003  96292  2.407     1
x[7]   0.003   0.003 0.993  -1.628  0.008  1.629  95558 2.389  96082  95492  96016  2.400     1
x[8]  -0.005   0.003 1.002  -1.646 -0.007  1.650  99937 2.498 100392  99920 100375  2.509     1
x[9]   0.004   0.003 0.998  -1.640  0.006  1.655 100824 2.521 101134 100831 101141  2.529     1
x[10] -0.002   0.003 0.990  -1.624 -0.007  1.630 102178 2.554 102983 102317 103126  2.578     1
x[11]  0.004   0.003 1.000  -1.645  0.006  1.661  95226 2.381  95863  95250  95886  2.397     1
x[12]  0.006   0.003 0.993  -1.619  0.005  1.626  97828 2.446  98357  97903  98433  2.461     1
x[13]  0.003   0.003 0.990  -1.621  0.010  1.650  97666 2.442  98166  97682  98181  2.455     1
x[14] -0.003   0.003 0.989  -1.629 -0.002  1.625  96805 2.420  97327  96792  97313  2.433     1
x[15]  0.006   0.003 0.992  -1.625  0.005  1.643  94911 2.373  95185  94950  95223  2.381     1
x[16] -0.001   0.003 1.006  -1.660  0.003  1.645  99541 2.489 100111  99413  99980  2.499     1
lp__  -7.951   0.023 2.790 -13.000 -7.660 -3.923  14922 0.373  14934  14480  14489  0.362     1
      sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
x[1]      1     1      1       1   16450   0.411    28709    0.718    82432    2.061    19205
x[2]      1     1      1       1   16462   0.412    29664    0.742    75494    1.887    19208
x[3]      1     1      1       1   16152   0.404    28669    0.717    78630    1.966    18732
x[4]      1     1      1       1   16075   0.402    29166    0.729    81148    2.029    19079
x[5]      1     1      1       1   16785   0.420    29930    0.748    79953    1.999    20116
x[6]      1     1      1       1   16578   0.414    28376    0.709    79018    1.975    19626
x[7]      1     1      1       1   17109   0.428    29238    0.731    81690    2.042    19543
x[8]      1     1      1       1   16400   0.410    29893    0.747    79263    1.982    18828
x[9]      1     1      1       1   15890   0.397    28621    0.716    81119    2.028    18683
x[10]     1     1      1       1   16460   0.412    29411    0.735    76948    1.924    19242
x[11]     1     1      1       1   15969   0.399    28488    0.712    79164    1.979    18329
x[12]     1     1      1       1   15294   0.382    29228    0.731    81841    2.046    18720
x[13]     1     1      1       1   15370   0.384    27421    0.686    80615    2.015    18210
x[14]     1     1      1       1   16452   0.411    27507    0.688    77592    1.940    19050
x[15]     1     1      1       1   16651   0.416    29139    0.728    80406    2.010    19622
x[16]     1     1      1       1   16395   0.410    29639    0.741    82347    2.059    18845
lp__      1     1      1       1   21484   0.537    19627    0.491    17074    0.427    23603
      madsreff
x[1]     0.480
x[2]     0.480
x[3]     0.468
x[4]     0.477
x[5]     0.503
x[6]     0.491
x[7]     0.489
x[8]     0.471
x[9]     0.467
x[10]    0.481
x[11]    0.458
x[12]    0.468
x[13]    0.455
x[14]    0.476
x[15]    0.491
x[16]    0.471
lp__     0.590

The Bulk-ESS for all \(x\) is larger than 9.522310^{4}. However tail-ESS for all \(x\) is less than 2.99300310^{4}. Further, bulk-ESS for lp__ is only 1.448910^{4}.
If we take a look at all the Stan examples in this notebook, we see that the bulk-ESS for lp__ is always below 0.5. This is because lp__ correlates strongly with the total energy in HMC, which is sampled using a random walk proposal once per iteration. Thus, it’s likely that lp__ has some random walk behavior, as well, leading to autocorrelation and a small relative efficiency. At the same time, adaptive HMC can create antithetic Markov chains which have negative auto-correlations at odd lags. This results in a bulk-ESS greater than S for some parameters.

Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval estimates for x[1].

plot_local_ess(fit_n, par = 1, nalpha = 100)

The effective sample size for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the effective sample size for the bulk, mean, and median estimates. Let’s check the effective sample size for quantiles.

plot_quantile_ess(fit = fit_n, par = 1, nalpha = 100)

Central quantile estimates have higher effective sample size than tail quantile estimates.

The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in the variance of \(x\). Let’s check the second moment of \(x\).

samp_x2 <- as.array(fit_n, pars = "x")^2
monitor(samp_x2)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

         Q5   Q50   Q95  Mean    SD  Rhat Bulk_ESS Tail_ESS
x[1]  0.004 0.458 3.849 1.006 1.438     1    16443    18225
x[2]  0.004 0.444 3.797 0.992 1.416     1    16492    19392
x[3]  0.004 0.446 3.799 0.980 1.385     1    16148    18342
x[4]  0.004 0.453 3.947 1.015 1.462     1    16070    18288
x[5]  0.004 0.453 3.865 0.998 1.415     1    16785    18672
x[6]  0.004 0.452 3.907 1.001 1.424     1    16572    17525
x[7]  0.004 0.452 3.742 0.987 1.388     1    17097    19120
x[8]  0.004 0.465 3.800 1.004 1.417     1    16397    18152
x[9]  0.004 0.448 3.806 0.995 1.408     1    15922    18049
x[10] 0.004 0.444 3.727 0.981 1.389     1    16461    18098
x[11] 0.004 0.458 3.854 1.001 1.410     1    16008    19463
x[12] 0.004 0.449 3.755 0.987 1.407     1    15368    17674
x[13] 0.004 0.444 3.834 0.981 1.381     1    15371    16755
x[14] 0.004 0.447 3.750 0.979 1.375     1    16461    17715
x[15] 0.004 0.451 3.766 0.983 1.384     1    16655    19241
x[16] 0.004 0.468 3.862 1.012 1.411     1    16400    19741

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(samp_x2)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

       mean se_mean    sd    Q5   Q50   Q95  seff  reff sseff zseff zsseff zsreff  Rhat sRhat zRhat
x[1]  1.006   0.012 1.438 0.004 0.458 3.849 14657 0.366 14664 16440  16443  0.411     1     1     1
x[2]  0.992   0.011 1.416 0.004 0.444 3.797 15511 0.388 15518 16488  16492  0.412     1     1     1
x[3]  0.980   0.011 1.385 0.004 0.446 3.799 14684 0.367 14703 16133  16148  0.404     1     1     1
x[4]  1.015   0.012 1.462 0.004 0.453 3.947 14510 0.363 14518 16039  16070  0.402     1     1     1
x[5]  0.998   0.012 1.415 0.004 0.453 3.865 15017 0.375 15029 16766  16785  0.420     1     1     1
x[6]  1.001   0.012 1.424 0.004 0.452 3.907 14368 0.359 14380 16547  16572  0.414     1     1     1
x[7]  0.987   0.011 1.388 0.004 0.452 3.742 15464 0.387 15488 17080  17097  0.427     1     1     1
x[8]  1.004   0.012 1.417 0.004 0.465 3.800 14773 0.369 14766 16397  16397  0.410     1     1     1
x[9]  0.995   0.012 1.408 0.004 0.448 3.806 13439 0.336 13467 15910  15922  0.398     1     1     1
x[10] 0.981   0.011 1.389 0.004 0.444 3.727 14654 0.366 14672 16430  16461  0.412     1     1     1
x[11] 1.001   0.011 1.410 0.004 0.458 3.854 15068 0.377 15098 15997  16008  0.400     1     1     1
x[12] 0.987   0.012 1.407 0.004 0.449 3.755 14215 0.355 14221 15342  15368  0.384     1     1     1
x[13] 0.981   0.012 1.381 0.004 0.444 3.834 13548 0.339 13547 15368  15371  0.384     1     1     1
x[14] 0.979   0.011 1.375 0.004 0.447 3.750 14547 0.364 14565 16418  16461  0.412     1     1     1
x[15] 0.983   0.011 1.384 0.004 0.451 3.766 15417 0.385 15414 16652  16655  0.416     1     1     1
x[16] 1.012   0.011 1.411 0.004 0.468 3.862 15551 0.389 15561 16390  16400  0.410     1     1     1
      zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
x[1]       1       1   18445   0.461    18225    0.456    19172    0.479    23268    0.582
x[2]       1       1   19699   0.492    19392    0.485    19309    0.483    24908    0.623
x[3]       1       1   18532   0.463    18342    0.459    18694    0.467    23865    0.597
x[4]       1       1   18983   0.475    18288    0.457    19156    0.479    23907    0.598
x[5]       1       1   19535   0.488    18672    0.467    20102    0.503    25043    0.626
x[6]       1       1   17535   0.438    17525    0.438    19596    0.490    22613    0.565
x[7]       1       1   19019   0.475    19120    0.478    19555    0.489    23336    0.583
x[8]       1       1   18920   0.473    18152    0.454    18816    0.470    24195    0.605
x[9]       1       1   18386   0.460    18049    0.451    18674    0.467    22198    0.555
x[10]      1       1   18570   0.464    18098    0.452    19147    0.479    23900    0.598
x[11]      1       1   19482   0.487    19463    0.487    18393    0.460    24709    0.618
x[12]      1       1   18691   0.467    17674    0.442    18588    0.465    23963    0.599
x[13]      1       1   18506   0.463    16755    0.419    18258    0.456    24132    0.603
x[14]      1       1   18823   0.471    17715    0.443    19062    0.477    23428    0.586
x[15]      1       1   19633   0.491    19241    0.481    19620    0.490    24476    0.612
x[16]      1       1   20083   0.502    19741    0.494    18829    0.471    24441    0.611

The mean of the bulk-ESS for \(x_j^2\) is 1.62906210^{4}, which is quite close to the bulk-ESS for lp__. This is not that surprising as the potential energy in normal model is proportional to \(\sum_{j=1}^J x_j^2\).

Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval probability estimates for x[1]^2.

plot_local_ess(fit = samp_x2, par = 1, nalpha = 100)

The effective sample size is mostly a bit below 1, but for the right tail of \(x_1^2\) the effective sample size drops. This is likely due to only some iterations having high enough total energy to obtain draws from the high energy part of the tail. Let’s check the effective sample size for quantiles.

plot_quantile_ess(fit = samp_x2, par = 1, nalpha = 100)

We can see the correlation between lp__ and magnitude of x[1] in the following plot.

samp <- as.array(fit_n)
qplot(
  as.vector(samp[, , "lp__"]),
  abs(as.vector(samp[, , "x[1]"]))
) + 
  labs(x = 'lp__', y = 'x[1]')

Low lp__ values corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.

qplot(
  as.vector(samp[, , "lp__"]),
  as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) + 
  labs(x = 'lp__', y = 'sum(x^2)')

This shows that even if we get high effective sample size estimates for central quantities (like mean or median), it is important to look at the relative efficiency of scale and tail quantities, as well. The effective sample size of lp__ can also indicate problems of sampling in the tails.

Original Computing Environment

makevars <- file.path(Sys.getenv("HOME"), ".R/Makevars")
if (file.exists(makevars)) {
  writeLines(readLines(makevars)) 
}
CXX14FLAGS=-O3 -Wno-unused-variable -Wno-unused-function
CXX14 = $(BINPREF)g++ -m$(WIN) -std=c++1y
CXX11FLAGS=-O3 -Wno-unused-variable -Wno-unused-function
devtools::session_info("rstan")
- Session info -----------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  German_Germany.1252         
 ctype    German_Germany.1252         
 tz       Europe/Berlin               
 date     2018-12-19                  

- Packages ---------------------------------------------------------------------------------------
 package      * version    date       lib source                          
 assertthat     0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                  
 backports      1.1.3      2018-12-14 [1] CRAN (R 3.5.1)                  
 BH             1.66.0-1   2018-02-13 [1] CRAN (R 3.5.0)                  
 callr          3.1.0      2018-12-10 [1] CRAN (R 3.5.1)                  
 cli            1.0.1      2018-09-25 [1] CRAN (R 3.5.1)                  
 colorspace     1.3-2      2016-12-14 [1] CRAN (R 3.5.0)                  
 crayon         1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                  
 desc           1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                  
 digest         0.6.18     2018-10-10 [1] CRAN (R 3.5.1)                  
 fansi          0.4.0      2018-10-05 [1] CRAN (R 3.5.1)                  
 ggplot2      * 3.1.0      2018-10-25 [1] CRAN (R 3.5.1)                  
 glue           1.3.0      2018-07-17 [1] CRAN (R 3.5.1)                  
 gridExtra    * 2.3        2017-09-09 [1] CRAN (R 3.5.0)                  
 gtable         0.2.0      2016-02-26 [1] CRAN (R 3.5.0)                  
 inline         0.3.15     2018-05-18 [1] CRAN (R 3.5.1)                  
 labeling       0.3        2014-08-23 [1] CRAN (R 3.5.0)                  
 lattice        0.20-35    2017-03-25 [2] CRAN (R 3.5.1)                  
 lazyeval       0.2.1      2017-10-29 [1] CRAN (R 3.5.0)                  
 loo            2.0.0      2018-04-11 [1] CRAN (R 3.5.1)                  
 magrittr       1.5        2014-11-22 [1] CRAN (R 3.5.0)                  
 MASS           7.3-50     2018-04-30 [2] CRAN (R 3.5.1)                  
 Matrix         1.2-14     2018-04-13 [2] CRAN (R 3.5.1)                  
 matrixStats    0.54.0     2018-07-23 [1] CRAN (R 3.5.1)                  
 mgcv           1.8-26     2018-11-21 [1] CRAN (R 3.5.1)                  
 munsell        0.5.0      2018-06-12 [1] CRAN (R 3.5.1)                  
 nlme           3.1-137    2018-04-07 [2] CRAN (R 3.5.1)                  
 pillar         1.3.1      2018-12-15 [1] CRAN (R 3.5.1)                  
 pkgbuild       1.0.2      2018-10-16 [1] CRAN (R 3.5.1)                  
 plyr           1.8.4      2016-06-08 [1] CRAN (R 3.5.0)                  
 prettyunits    1.0.2      2015-07-13 [1] CRAN (R 3.5.1)                  
 processx       3.2.1      2018-12-05 [1] CRAN (R 3.5.1)                  
 ps             1.2.1      2018-11-06 [1] CRAN (R 3.5.1)                  
 R6             2.3.0      2018-10-04 [1] CRAN (R 3.5.1)                  
 RColorBrewer   1.1-2      2014-12-07 [1] CRAN (R 3.5.0)                  
 Rcpp           1.0.0      2018-11-07 [1] CRAN (R 3.5.1)                  
 RcppEigen      0.3.3.5.0  2018-11-24 [1] CRAN (R 3.5.1)                  
 reshape2       1.4.3      2017-12-11 [1] CRAN (R 3.5.0)                  
 rlang          0.3.0.1    2018-10-25 [1] CRAN (R 3.5.1)                  
 rprojroot      1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                  
 rstan        * 2.18.2     2018-11-07 [1] CRAN (R 3.5.1)                  
 scales         1.0.0      2018-08-09 [1] CRAN (R 3.5.1)                  
 StanHeaders  * 2.18.0-1   2018-12-13 [1] CRAN (R 3.5.1)                  
 stringi        1.2.4      2018-07-20 [1] CRAN (R 3.5.1)                  
 stringr      * 1.3.1      2018-05-10 [1] CRAN (R 3.5.1)                  
 tibble       * 1.4.2      2018-01-22 [1] CRAN (R 3.5.0)                  
 utf8           1.1.4      2018-05-24 [1] CRAN (R 3.5.1)                  
 viridisLite    0.3.0      2018-02-01 [1] CRAN (R 3.5.0)                  
 withr          2.1.2.9000 2018-12-18 [1] Github (jimhester/withr@be57595)

[1] C:/Users/paulb/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library